1 Loads & Install Packages

if (!require("nnet")) install.packages("nnet")
## Caricamento del pacchetto richiesto: nnet
if (!require("MASS")) install.packages("MASS")
## Caricamento del pacchetto richiesto: MASS
if (!require("e1071")) install.packages("e1071")
## Caricamento del pacchetto richiesto: e1071
if (!require("class")) install.packages("class")
## Caricamento del pacchetto richiesto: class
if (!require("leaps")) install.packages("leaps")
## Caricamento del pacchetto richiesto: leaps
if (!require("glmnet")) install.packages("glmnet")
## Caricamento del pacchetto richiesto: glmnet
## Caricamento del pacchetto richiesto: Matrix
## Loaded glmnet 4.1-8
if (!require("car")) install.packages("car")
## Caricamento del pacchetto richiesto: car
## Caricamento del pacchetto richiesto: carData
if (!require("caTools")) install.packages("caTools")
## Caricamento del pacchetto richiesto: caTools
if (!require("mgcv")) install.packages("mgcv")
## Caricamento del pacchetto richiesto: mgcv
## Caricamento del pacchetto richiesto: nlme
## This is mgcv 1.9-0. For overview type 'help("mgcv-package")'.
## 
## Caricamento pacchetto: 'mgcv'
## Il seguente oggetto è mascherato da 'package:nnet':
## 
##     multinom
if (!require("summarytools")) install.packages("summarytools")
## Caricamento del pacchetto richiesto: summarytools
if (!require("dplyr")) install.packages("dplyr")
## Caricamento del pacchetto richiesto: dplyr
## 
## Caricamento pacchetto: 'dplyr'
## Il seguente oggetto è mascherato da 'package:nlme':
## 
##     collapse
## Il seguente oggetto è mascherato da 'package:car':
## 
##     recode
## Il seguente oggetto è mascherato da 'package:MASS':
## 
##     select
## I seguenti oggetti sono mascherati da 'package:stats':
## 
##     filter, lag
## I seguenti oggetti sono mascherati da 'package:base':
## 
##     intersect, setdiff, setequal, union
if (!require("ggplot2")) install.packages("ggplot2")
## Caricamento del pacchetto richiesto: ggplot2
if (!require("tidyverse")) install.packages("tidyverse")
## Caricamento del pacchetto richiesto: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” forcats   1.0.0     âś” stringr   1.5.1
## âś” lubridate 1.9.3     âś” tibble    3.2.1
## âś” purrr     1.0.2     âś” tidyr     1.3.0
## âś” readr     2.1.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::collapse() masks nlme::collapse()
## âś– tidyr::expand()   masks Matrix::expand()
## âś– dplyr::filter()   masks stats::filter()
## âś– dplyr::lag()      masks stats::lag()
## âś– tidyr::pack()     masks Matrix::pack()
## âś– dplyr::recode()   masks car::recode()
## âś– dplyr::select()   masks MASS::select()
## âś– purrr::some()     masks car::some()
## âś– tidyr::unpack()   masks Matrix::unpack()
## âś– tibble::view()    masks summarytools::view()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
if (!require("lubridate")) install.packages("lubridate")
if (!require("mapview")) install.packages("mapview")
## Caricamento del pacchetto richiesto: mapview
if (!require("sf")) install.packages("sf")
## Caricamento del pacchetto richiesto: sf
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
if (!require("geojsonio")) install.packages("geojsonio")
## Caricamento del pacchetto richiesto: geojsonio
## Registered S3 method overwritten by 'geojsonsf':
##   method        from   
##   print.geojson geojson
## 
## Caricamento pacchetto: 'geojsonio'
## 
## Il seguente oggetto è mascherato da 'package:base':
## 
##     pretty
if (!require("leaflet")) install.packages("leaflet")
## Caricamento del pacchetto richiesto: leaflet
if (!require("broom")) install.packages("broom")
## Caricamento del pacchetto richiesto: broom
if (!require("plotly")) install.packages("plotly")
## Caricamento del pacchetto richiesto: plotly
## 
## Caricamento pacchetto: 'plotly'
## 
## Il seguente oggetto è mascherato da 'package:ggplot2':
## 
##     last_plot
## 
## Il seguente oggetto è mascherato da 'package:MASS':
## 
##     select
## 
## Il seguente oggetto è mascherato da 'package:stats':
## 
##     filter
## 
## Il seguente oggetto è mascherato da 'package:graphics':
## 
##     layout
if (!require("gridExtra")) install.packages("gridExtra")
## Caricamento del pacchetto richiesto: gridExtra
## 
## Caricamento pacchetto: 'gridExtra'
## 
## Il seguente oggetto è mascherato da 'package:dplyr':
## 
##     combine
library(nnet)
library(MASS)
library(e1071)
library(class)
library(leaps)
library(glmnet)
library(car)
library(caTools)
library(mgcv)

library(summarytools)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(lubridate)
library(mapview)
library(sf)
library(geojsonio)
library(leaflet) 
library(broom)
library(plotly)
library(gridExtra)

2 Dataset description

The Fire Incident Dispatch Data file contains data that is generated by the Starfire Computer Aided Dispatch System. The data spans from the time the incident is created in the system to the time the incident is closed in the system. It covers information about the incident as it relates to the assignment of resources and the Fire Department’s response to the emergency. To protect personal identifying information in accordance with the Health Insurance Portability and Accountability Act (HIPAA), specific locations of incidents are not included and have been aggregated to a higher level of detail.

In this analysis we have restricted the analysis only on the last 50.000 observations from 5th of September to 30th of the same month.

  1. STARFIRE_INCIDENT_ID: An incident identifier comprising the 5 character julian date, 4 character alarm box number, 2 character number of incidents at the box so far for the day, 1 character borough code , 4 character sequence number.

  2. INCIDENT_DATETIME: The date and time of the incident.

  3. ALARM_BOX_BOROUGH: The borough of the alarm box.

  4. ALARM_BOX_LOCATION: The location of the alarm box.

  5. ALARM_BOX: The alarm box number.

  6. INCIDENT_BOROUGH: The borough of the incident.

  7. ZIPCODE: The zip code of the incident.

  8. POLICEPRECINCT: The police precinct of the incident.

  9. CITYCOUNCILDISTRICT: The city council district.

  10. COMMUNITYDISTRICT: The community district.

  11. COMMUNITYSCHOOLDISTRICT: The community school district.

  12. CONGRESSIONALDISTRICT: The congressional district.

  13. ALARM_SOURCE_DESCRIPTION_TX: The description of the alarm source.

  14. ALARM_LEVEL_INDEX_DESCRIPTION: The alarm level index.

  15. HIGHEST_ALARM_LEVEL: The highest alarm level.

  16. INCIDENT_CLASSIFICATION: The incident classification.

  17. INCIDENT_CLASSIFICATION_GROUP: The incident classification roll up group.

  18. FIRST_ASSIGNMENT_DATETIME: The date and time of the first unit assignment.

  19. FIRST_ACTIVATION_DATETIME: The date and time of the first unit acknowledgement of the assignment.

  20. FIRST_ON_SCENE_DATETIME: The date and time of the first unit at the scene of the incident.

  21. INCIDENT_CLOSE_DATETIME: The date and time that the incident was closed in the dispatch system.

  22. VALID_DISPATCH_RSPNS_TIME_INDC: Indicates that the components comprising the generation of the DISPATCH_RESPONSE_SECONDS_QY are valid.

  23. DISPATCH_RESPONSE_SECONDS_QY: The elapsed time in seconds between the incident_datetime and the first_assignment_datetime.

  24. VALID_INCIDENT_RSPNS_TIME_INDC: Indicates that the components comprising the generation of the INCIDENT_RESPONSE_SECONDS_QY are valid.

  25. INCIDENT_RESPONSE_SECONDS_QY: The elapsed time in seconds between the incident_datetime and the first_onscene_datetime.

  26. INCIDENT_TRAVEL_TM_SECONDS_QY: The elapsed time in seconds between the first_assignment_datetime and the first_onscene_datetime.

  27. ENGINES_ASSIGNED_QUANTITY: The number of engine units assigned to the incident.

  28. LADDERS_ASSIGNED_QUANTITY: The number of ladder units assigned to the incident.

  29. OTHER_UNITS_ASSIGNED_QUANTITY: The number of units that are not engines or ladders that were assigned to the incident.

Regarding the response we will try create two different analysis. One with the aim to predict the INCIDENT_RESPONSE_SECONDS_QY and the other to predict the EMERGENCY_TIME which is the time difference between the FIRST_ON_SCENE_DATETIME and INCIDENT_CLOSE_DATETIME. Both analysis use a linear regression model, however we will see that the assumption for applying the linear regression will be not meet, thus we will simplify our project moving into classification, dividing in two or more ranges the two responses.

3 Data Exlporation and Cleaning

The first step is always to read the dataset and plot the first 5 observations

fire_data <- read.csv("datasets/Fire_Incident_Dispatch_Data_last_50k.csv")

head(fire_data)
##    STARFIRE_INCIDENT_ID      INCIDENT_DATETIME ALARM_BOX_BOROUGH
## 1 230905-B1937-001-0567 09/05/2023 02:19:04 PM          BROOKLYN
## 2 230905-B3923-002-0568 09/05/2023 02:19:36 PM          BROOKLYN
## 3 230905-X8897-003-0480 09/05/2023 02:19:43 PM             BRONX
## 4 230905-X3466-001-0481 09/05/2023 02:21:00 PM             BRONX
## 5 230905-B2448-001-0570 09/05/2023 02:21:26 PM          BROOKLYN
## 6 230905-B2448-002-0571 09/05/2023 02:22:35 PM          BROOKLYN
##   ALARM_BOX_NUMBER                    ALARM_BOX_LOCATION INCIDENT_BOROUGH
## 1             1937                AUTUMN AVE & FULTON ST         BROOKLYN
## 2             3923          N/S EASTERN PWAY & UTICA AVE         BROOKLYN
## 3             8897 CROSS BX EXPY- DEEGAN EX TO JEROME AV            BRONX
## 4             3466                  ADEE AVE & BX PARK E            BRONX
## 5             2448             GLENWOOD RD & BEDFORD AVE         BROOKLYN
## 6             2448             GLENWOOD RD & BEDFORD AVE         BROOKLYN
##   ZIPCODE POLICEPRECINCT CITYCOUNCILDISTRICT COMMUNITYDISTRICT
## 1   11208             75                  37               305
## 2   11213             71                  35               309
## 3      NA             NA                  NA                NA
## 4   10467             49                  12               211
## 5   11210             70                  45               314
## 6   11210             70                  45               314
##   COMMUNITYSCHOOLDISTRICT CONGRESSIONALDISTRICT ALARM_SOURCE_DESCRIPTION_TX
## 1                      19                     7                         EMS
## 2                      17                     9                     CLASS-3
## 3                      NA                    NA                     EMS-911
## 4                      11                    15                         EMS
## 5                      22                     9                         EMS
## 6                      22                     9                         EMS
##   ALARM_LEVEL_INDEX_DESCRIPTION HIGHEST_ALARM_LEVEL
## 1                 Initial Alarm         First Alarm
## 2                 Initial Alarm         First Alarm
## 3                 Initial Alarm         First Alarm
## 4                DEFAULT RECORD         First Alarm
## 5                DEFAULT RECORD         First Alarm
## 6                DEFAULT RECORD         First Alarm
##                  INCIDENT_CLASSIFICATION INCIDENT_CLASSIFICATION_GROUP
## 1 Medical - No PT Contact EMS is Onscene           Medical Emergencies
## 2                          Hospital Fire              Structural Fires
## 3               Vehicle Accident - Other        NonMedical Emergencies
## 4               Medical - EMS Link 10-91           Medical Emergencies
## 5               Medical - EMS Link 10-91           Medical Emergencies
## 6               Medical - EMS Link 10-91           Medical Emergencies
##   DISPATCH_RESPONSE_SECONDS_QY FIRST_ASSIGNMENT_DATETIME
## 1                            7    09/05/2023 02:19:12 PM
## 2                           95    09/05/2023 02:21:11 PM
## 3                           41    09/05/2023 02:20:25 PM
## 4                          298    09/05/2023 02:25:59 PM
## 5                           25    09/05/2023 02:21:52 PM
## 6                          350    09/05/2023 02:28:25 PM
##   FIRST_ACTIVATION_DATETIME FIRST_ON_SCENE_DATETIME INCIDENT_CLOSE_DATETIME
## 1    09/05/2023 02:19:26 PM  09/05/2023 02:25:23 PM  09/05/2023 03:03:15 PM
## 2    09/05/2023 02:21:33 PM  09/05/2023 02:23:21 PM  09/05/2023 02:34:18 PM
## 3    09/05/2023 02:20:35 PM  09/05/2023 02:26:22 PM  09/05/2023 04:13:32 PM
## 4    09/05/2023 02:26:04 PM                          09/05/2023 02:34:23 PM
## 5    09/05/2023 02:22:08 PM                          09/05/2023 02:28:07 PM
## 6                                                    09/05/2023 02:29:09 PM
##   VALID_DISPATCH_RSPNS_TIME_INDC VALID_INCIDENT_RSPNS_TIME_INDC
## 1                              N                              Y
## 2                              N                              Y
## 3                              N                              Y
## 4                              N                              N
## 5                              N                              N
## 6                              N                              N
##   INCIDENT_RESPONSE_SECONDS_QY INCIDENT_TRAVEL_TM_SECONDS_QY
## 1                          378                           371
## 2                          224                           129
## 3                          398                           357
## 4                           NA                            NA
## 5                           NA                            NA
## 6                           NA                            NA
##   ENGINES_ASSIGNED_QUANTITY LADDERS_ASSIGNED_QUANTITY
## 1                         1                         0
## 2                         3                         2
## 3                         2                         3
## 4                         1                         0
## 5                         1                         0
## 6                         1                         0
##   OTHER_UNITS_ASSIGNED_QUANTITY
## 1                             0
## 2                             1
## 3                             1
## 4                             0
## 5                             0
## 6                             0

Use dfSummary from summarytool in order to have a complete and clear sumamry of the dataset.

print(dfSummary(fire_data, 
                plain.ascii  = FALSE, 
                style        = "multiline", 
                headings     = FALSE,
                graph.magnif = 0.8, 
                valid.col    = FALSE),
                method = 'render')
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 STARFIRE_INCIDENT_ID [character]
1. 230905-B0042-001-1051
2. 230905-B0053-001-0760
3. 230905-B0053-002-0910
4. 230905-B0081-001-1137
5. 230905-B0106-002-0632
6. 230905-B0132-001-0713
7. 230905-B0147-001-0967
8. 230905-B0160-001-1125
9. 230905-B0163-001-1026
10. 230905-B0165-001-0778
[ 49990 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
49990(100.0%)
0 (0.0%)
2 INCIDENT_DATETIME [character]
1. 09/07/2023 03:53:19 PM
2. 09/11/2023 09:44:33 AM
3. 09/13/2023 12:09:35 AM
4. 09/29/2023 09:44:26 AM
5. 09/05/2023 03:30:51 PM
6. 09/05/2023 03:37:48 PM
7. 09/05/2023 03:53:11 PM
8. 09/05/2023 04:01:29 PM
9. 09/05/2023 04:32:57 PM
10. 09/05/2023 04:59:57 PM
[ 49364 others ]
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
49976(100.0%)
0 (0.0%)
3 ALARM_BOX_BOROUGH [character]
1. BRONX
2. BROOKLYN
3. MANHATTAN
4. QUEENS
5. RICHMOND / STATEN ISLAND
10973(21.9%)
13980(28.0%)
12890(25.8%)
9879(19.8%)
2278(4.6%)
0 (0.0%)
4 ALARM_BOX_NUMBER [integer]
Mean (sd) : 2930.3 (2446.5)
min ≤ med ≤ max:
10 ≤ 2275 ≤ 9933
IQR (CV) : 2772 (0.8)
7411 distinct values 0 (0.0%)
5 ALARM_BOX_LOCATION [character]
1. 8 AVE & W 155 ST
2. 10 RICHMAN PLZ/SEDGWICK A
3. AMSTERDAM AVE & LA SALLE
4. 3 AVE & E 143 ST
5. WASHINGTON AVE & E 170 ST
6. FDR DR & E 6 ST
7. CONCOURSE VILLAGE E & E 1
8. PARK AVE & E 158 ST
9. UNION TPK & WINCHESTER BL
10. 8 AVE & W 33 ST
[ 12203 others ]
85(0.2%)
75(0.1%)
50(0.1%)
48(0.1%)
48(0.1%)
45(0.1%)
44(0.1%)
40(0.1%)
40(0.1%)
39(0.1%)
49486(99.0%)
0 (0.0%)
6 INCIDENT_BOROUGH [character]
1. BRONX
2. BROOKLYN
3. MANHATTAN
4. QUEENS
5. RICHMOND / STATEN ISLAND
10973(21.9%)
13980(28.0%)
12890(25.8%)
9879(19.8%)
2278(4.6%)
0 (0.0%)
7 ZIPCODE [integer]
Mean (sd) : 10737.9 (551.8)
min ≤ med ≤ max:
10000 ≤ 10472 ≤ 11697
IQR (CV) : 1098 (0.1)
217 distinct values 3181 (6.4%)
8 POLICEPRECINCT [integer]
Mean (sd) : 62.3 (34.8)
min ≤ med ≤ max:
1 ≤ 61 ≤ 123
IQR (CV) : 56 (0.6)
77 distinct values 3180 (6.4%)
9 CITYCOUNCILDISTRICT [integer]
Mean (sd) : 23.1 (15.1)
min ≤ med ≤ max:
1 ≤ 21 ≤ 51
IQR (CV) : 27 (0.7)
51 distinct values 3180 (6.4%)
10 COMMUNITYDISTRICT [integer]
Mean (sd) : 262.9 (119.4)
min ≤ med ≤ max:
101 ≤ 302 ≤ 595
IQR (CV) : 206 (0.5)
70 distinct values 3180 (6.4%)
11 COMMUNITYSCHOOLDISTRICT [integer]
Mean (sd) : 14.8 (9.7)
min ≤ med ≤ max:
1 ≤ 13 ≤ 32
IQR (CV) : 18 (0.7)
32 distinct values 3182 (6.4%)
12 CONGRESSIONALDISTRICT [integer]
Mean (sd) : 10.4 (3.3)
min ≤ med ≤ max:
3 ≤ 11 ≤ 16
IQR (CV) : 5 (0.3)
13 distinct values 3180 (6.4%)
13 ALARM_SOURCE_DESCRIPTION_TX [character]
1. 911
2. 911TEXT
3. BARS
4. CLASS-3
5. EMS
6. EMS-911
7. ERS
8. ERS-NC
9. PHONE
10. SOL
11. VERBAL
302(0.6%)
14(0.0%)
1(0.0%)
5025(10.1%)
17178(34.4%)
10520(21.0%)
777(1.6%)
1(0.0%)
15146(30.3%)
5(0.0%)
1031(2.1%)
0 (0.0%)
14 ALARM_LEVEL_INDEX_DESCRIPTION [character]
1. 10-75 Signal (Request for
2. 10-76 & 10-77 Signal (Not
3. 7-5 (All Hands Alarm)
4. DEFAULT RECORD
5. Initial Alarm
6. Second Alarm
7. Third Alarm
13(0.0%)
3(0.0%)
100(0.2%)
17313(34.6%)
32562(65.1%)
8(0.0%)
1(0.0%)
0 (0.0%)
15 HIGHEST_ALARM_LEVEL [character]
1. All Hands Working
2. First Alarm
3. Second Alarm
4. Third Alarm
100(0.2%)
49891(99.8%)
8(0.0%)
1(0.0%)
0 (0.0%)
16 INCIDENT_CLASSIFICATION [character]
1. Medical - EMS Link 10-91
2. Medical - PD Link 10-91
3. Medical - Breathing / Ill
4. Medical - No PT Contact E
5. Assist Civilian - Non-Med
6. Alarm System - Unnecessar
7. Elevator Emergency - Occu
8. Vehicle Accident - Other
9. Utility Emergency - Gas
10. Odor - Other Than Smoke
[ 57 others ]
9509(19.0%)
5741(11.5%)
5453(10.9%)
5013(10.0%)
4140(8.3%)
2845(5.7%)
1954(3.9%)
1543(3.1%)
1359(2.7%)
1337(2.7%)
11106(22.2%)
0 (0.0%)
17 INCIDENT_CLASSIFICATION_GROUP [character]
1. Medical Emergencies
2. Medical MFAs
3. NonMedical Emergencies
4. NonMedical MFAs
5. NonStructural Fires
6. Structural Fires
26824(53.6%)
208(0.4%)
19072(38.1%)
1680(3.4%)
703(1.4%)
1513(3.0%)
0 (0.0%)
18 DISPATCH_RESPONSE_SECONDS_QY [integer]
Mean (sd) : 40 (133.1)
min ≤ med ≤ max:
2 ≤ 19 ≤ 9023
IQR (CV) : 33 (3.3)
841 distinct values 0 (0.0%)
19 FIRST_ASSIGNMENT_DATETIME [character]
1. 09/06/2023 01:40:49 PM
2. 09/08/2023 02:43:53 PM
3. 09/20/2023 01:09:29 PM
4. 09/27/2023 02:04:41 PM
5. 09/05/2023 02:34:37 PM
6. 09/05/2023 03:38:43 PM
7. 09/05/2023 03:48:54 PM
8. 09/05/2023 03:56:07 PM
9. 09/05/2023 05:01:22 PM
10. 09/05/2023 07:13:08 PM
[ 49499 others ]
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
49976(100.0%)
0 (0.0%)
20 FIRST_ACTIVATION_DATETIME [character]
1. (Empty string)
2. 09/22/2023 02:07:47 PM
3. 09/07/2023 06:59:12 PM
4. 09/10/2023 06:28:10 PM
5. 09/17/2023 07:27:03 PM
6. 09/23/2023 08:01:29 PM
7. 09/25/2023 10:17:28 AM
8. 09/29/2023 08:16:05 AM
9. 09/05/2023 02:47:25 PM
10. 09/05/2023 03:00:12 PM
[ 49196 others ]
139(0.3%)
4(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
49835(99.7%)
0 (0.0%)
21 FIRST_ON_SCENE_DATETIME [character]
1. (Empty string)
2. 09/30/2023 04:01:43 PM
3. 09/05/2023 03:17:20 PM
4. 09/05/2023 03:27:35 PM
5. 09/05/2023 04:37:22 PM
6. 09/05/2023 04:38:47 PM
7. 09/05/2023 04:39:30 PM
8. 09/05/2023 05:44:27 PM
9. 09/05/2023 05:55:56 PM
10. 09/05/2023 08:59:49 PM
[ 35543 others ]
14112(28.2%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
35869(71.7%)
0 (0.0%)
22 INCIDENT_CLOSE_DATETIME [character]
1. 09/05/2023 06:13:06 PM
2. 09/10/2023 02:16:37 PM
3. 09/24/2023 04:10:06 PM
4. 09/25/2023 12:20:57 AM
5. 09/27/2023 04:38:25 PM
6. 09/29/2023 10:42:38 AM
7. 09/30/2023 06:06:40 PM
8. 09/05/2023 03:25:13 PM
9. 09/05/2023 04:08:06 PM
10. 09/05/2023 05:08:09 PM
[ 49399 others ]
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
49973(99.9%)
0 (0.0%)
23 VALID_DISPATCH_RSPNS_TIME_INDC [character] 1. N
50000(100.0%)
0 (0.0%)
24 VALID_INCIDENT_RSPNS_TIME_INDC [character]
1. N
2. Y
17036(34.1%)
32964(65.9%)
0 (0.0%)
25 INCIDENT_RESPONSE_SECONDS_QY [integer]
Mean (sd) : 380.7 (233.2)
min ≤ med ≤ max:
18 ≤ 334 ≤ 7130
IQR (CV) : 161 (0.6)
1496 distinct values 14112 (28.2%)
26 INCIDENT_TRAVEL_TM_SECONDS_QY [integer]
Mean (sd) : 340.5 (208.6)
min ≤ med ≤ max:
0 ≤ 301 ≤ 7122
IQR (CV) : 159 (0.6)
1382 distinct values 14112 (28.2%)
27 ENGINES_ASSIGNED_QUANTITY [integer]
Mean (sd) : 1.1 (0.8)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 0 (0.7)
15 distinct values 62 (0.1%)
28 LADDERS_ASSIGNED_QUANTITY [integer]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 15
IQR (CV) : 1 (1.4)
12 distinct values 62 (0.1%)
29 OTHER_UNITS_ASSIGNED_QUANTITY [integer]
Mean (sd) : 0.3 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 32
IQR (CV) : 0 (2.8)
23 distinct values 62 (0.1%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-14

Now we rename all the columns in order to be smaller whenever we plot graphs.

fire_data <- fire_data %>%
            rename(id = STARFIRE_INCIDENT_ID, datetime = INCIDENT_DATETIME, al_borough = ALARM_BOX_BOROUGH,
                   al_number = ALARM_BOX_NUMBER,al_location = ALARM_BOX_LOCATION, inc_borough = INCIDENT_BOROUGH,
                   zipcode = ZIPCODE, pol_prec = POLICEPRECINCT, city_con_dist = CITYCOUNCILDISTRICT,
                   commu_dist = COMMUNITYDISTRICT, commu_sc_dist = COMMUNITYSCHOOLDISTRICT,
                   cong_dist = CONGRESSIONALDISTRICT, al_source_desc = ALARM_SOURCE_DESCRIPTION_TX,
                   al_index_desc = ALARM_LEVEL_INDEX_DESCRIPTION, highest_al_level = HIGHEST_ALARM_LEVEL,
                   inc_class = INCIDENT_CLASSIFICATION, inc_class_group = INCIDENT_CLASSIFICATION_GROUP,
                   first_ass_datetime = FIRST_ASSIGNMENT_DATETIME, first_act_datetime = FIRST_ACTIVATION_DATETIME,
                   first_onscene_datetime = FIRST_ON_SCENE_DATETIME, inc_close_datetime = INCIDENT_CLOSE_DATETIME, 
                   
                   disp_resp_sec_qy = DISPATCH_RESPONSE_SECONDS_QY, disp_resp_sec_indc = VALID_DISPATCH_RSPNS_TIME_INDC,
                   inc_resp_sec_qy = INCIDENT_RESPONSE_SECONDS_QY, inc_resp_sec_indc = VALID_INCIDENT_RSPNS_TIME_INDC,
                   
                   inc_travel_sec_qy = INCIDENT_TRAVEL_TM_SECONDS_QY, 
                   
                   engines_assigned = ENGINES_ASSIGNED_QUANTITY,
                   ladders_assigned = LADDERS_ASSIGNED_QUANTITY, others_units_assigned = OTHER_UNITS_ASSIGNED_QUANTITY)

As we can see from the summary there are many NA values, and many predictors that are as characters and not factors. In this step we will convert the characters predictors as factors merging the values that appear less in the dataset, so we do no have many values that have low frequency in our dataset.

In addition we will add he predictor for the day_number, a factorial predictor to indicate in the incident day is a week day or not dat_type and a factorial predictor time_of_day that indicates the range of time whenever the incident happens, so Night (if the hour is between 0 and 6), Morning (if the hour is between 6 and 12), Afternoon (if the hour is between 12 and 18), Evening (if the hour is between 18 and 24).

Since we are dealing with datetime we also check if the differences (inc_resp_sec_qy, inc_travel_sec_qy and disp_resp_sec_qy) are actually corrects, if not we replace them with the correct one.

Finally we decided to add an additional time difference the emergency_min_qy which represents the difference between the inc_close_datetime and the first_onscene_datetime.

# set factorial
fire_data$inc_borough <- as.factor(fire_data$inc_borough)
fire_data$al_borough <- as.factor(fire_data$al_borough)
fire_data$al_source_desc <- as.factor(fire_data$al_source_desc)
fire_data$al_index_desc <- as.factor(fire_data$al_index_desc)
fire_data$highest_al_level <- as.factor(fire_data$highest_al_level)

fire_data$disp_resp_sec_indc <- as.factor(fire_data$disp_resp_sec_indc)
levels(fire_data$disp_resp_sec_indc)<- c("N", "Y")

fire_data$inc_resp_sec_indc <- as.factor(fire_data$inc_resp_sec_indc)
levels(fire_data$inc_resp_sec_indc)<- c("N", "Y")

fire_data$inc_class_group <- as.factor(fire_data$inc_class_group)
fire_data$inc_class <- as.factor(fire_data$inc_class)

Moreover we note that the maximum level of the time differences is very high to be considered as seconds so we decided to scale the two indicators in minutes.

summary(fire_data %>% select(inc_resp_sec_qy, inc_travel_sec_qy, disp_resp_sec_qy))
##  inc_resp_sec_qy  inc_travel_sec_qy disp_resp_sec_qy 
##  Min.   :  18.0   Min.   :   0.0    Min.   :   2.00  
##  1st Qu.: 265.0   1st Qu.: 233.0    1st Qu.:   7.00  
##  Median : 334.0   Median : 301.0    Median :  19.00  
##  Mean   : 380.7   Mean   : 340.5    Mean   :  39.96  
##  3rd Qu.: 426.0   3rd Qu.: 392.0    3rd Qu.:  40.00  
##  Max.   :7130.0   Max.   :7122.0    Max.   :9023.00  
##  NA's   :14112    NA's   :14112
# scaling
fire_data$inc_resp_sec_qy <- fire_data$inc_resp_sec_qy / 60
fire_data$inc_travel_sec_qy <- fire_data$inc_travel_sec_qy / 60
fire_data$disp_resp_sec_qy <- fire_data$disp_resp_sec_qy / 60 

# renaming both quantity and indicator predictors for the two datetime 
fire_data <- fire_data %>% rename(inc_resp_min_qy = inc_resp_sec_qy, inc_travel_min_qy = inc_travel_sec_qy, disp_resp_min_qy = disp_resp_sec_qy, # quantity
                                  inc_resp_min_indc = inc_resp_sec_indc, disp_resp_min_indc = disp_resp_sec_indc) # indicator

Here we create the time_of_day and is_weekend

# Process datetime column
fire_data$datetime <- mdy_hms(fire_data$datetime)
fire_data$first_ass_datetime <- mdy_hms(fire_data$first_ass_datetime)
fire_data$first_act_datetime <- mdy_hms(fire_data$first_act_datetime)
fire_data$first_onscene_datetime <- mdy_hms(fire_data$first_onscene_datetime)
fire_data$inc_close_datetime <- mdy_hms(fire_data$inc_close_datetime)


# checking if the differences are well computed if not change with the correct one

if (!identical(fire_data$inc_resp_min_qy, as.numeric(difftime(fire_data$first_onscene_datetime, fire_data$datetime, units="mins")))){
  fire_data$inc_resp_min_qy <- as.numeric(difftime(fire_data$first_onscene_datetime, fire_data$datetime, units="mins"))
}

if (!identical(fire_data$inc_travel_min_qy, as.numeric(difftime(fire_data$first_onscene_datetime, fire_data$first_ass_datetime, units="mins")))){
  fire_data$inc_travel_min_qy <- as.numeric(difftime(fire_data$first_onscene_datetime, fire_data$first_ass_datetime, units="mins"))
}

if (!identical(fire_data$disp_resp_min_qy, as.numeric(difftime(fire_data$first_ass_datetime, fire_data$datetime, units="mins")))){
  fire_data$disp_resp_min_qy <- as.numeric(difftime(fire_data$first_ass_datetime, fire_data$datetime, units="mins"))
}

# creating emergency_min_qy which describe the time taken by the firefighter to close the emergency after have been arrived to the location 
fire_data$emergency_min_qy <- as.numeric(difftime(fire_data$inc_close_datetime, fire_data$first_onscene_datetime, units="mins"))

# creating day_type
fire_data$day_type <- as.factor(ifelse(weekdays(fire_data$datetime) %in% c("sabato", "domenica"), "Weekend", "Weekday"))

# creating ticket_time
fire_data$ticket_time <- as.numeric(difftime(fire_data$inc_close_datetime, fire_data$datetime, units="mins"))
  
# creating time_of_day
fire_data$time_of_day <- cut(
    hour(fire_data$datetime),
    breaks = c(0, 6, 12, 18, 24),
    labels = c("Night", "Morning", "Afternoon", "Evening"),
    include.lowest = TRUE,
    right = TRUE
)
  
fire_data$datetime <- NULL
table(fire_data$time_of_day)
## 
##     Night   Morning Afternoon   Evening 
##      8521     13270     16499     11710
ggplot(data=fire_data %>% 
          group_by(time_of_day) %>%
          summarise(incident_number = n()), 
        aes(x=time_of_day, y=incident_number)) + 
      geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=1.6, color="white", position = position_dodge(0.9), size=3.5) +
      labs(title = "Time of the Day - Incident Count", x = "Time of the Day", y = "Incident Count")

From this we can see that the higher number of fire incident is registered from 12 PM to 18 PM, whereas the lower number of fire incident happened from the 00 AM to 06 AM.

day_type_table <- table(fire_data$day_type)
day_type_table[1] <- day_type_table[1] / 5
day_type_table[2] <- day_type_table[2] / 2
day_type_table
## 
## Weekday Weekend 
##  7296.4  6759.0

And in proportion we can see that on average there is an higher number of fire incident on the week day respect to the week end days.

Now regarding the assigned untis we decided to add a summary predictor that include the sum of all the three assigned units predictors.

fire_data$total_assigned_unit <- fire_data$engines_assigned + fire_data$ladders_assigned + fire_data$others_units_assigned

Rename the factor levels for the inc_borough and predictors

fire_data <- fire_data %>% mutate(inc_borough = recode_factor(
                  inc_borough, "BRONX" = "Bronx", "BROOKLYN" = "Brooklyn", "MANHATTAN" = "Manhattan",
                  "QUEENS" = "Queens", "RICHMOND / STATEN ISLAND" = "Staten Island"),
                  
                  al_borough = recode_factor(
                  al_borough, "BRONX" = "Bronx", "BROOKLYN" = "Brooklyn", "MANHATTAN" = "Manhattan",
                  "QUEENS" = "Queens", "RICHMOND / STATEN ISLAND" = "Staten Island"))

At this point we merge some possible value from factorial predictors to make the space of possible choice smaller.

Here we merge the following factorial values of highest_al_level: Second Alarm and Third Alarm into 2nd-3rd Alarm.

# highest_al_level
fire_data$highest_alarm_lev_new <- fire_data$highest_al_level
levels(fire_data$highest_alarm_lev_new) <- list(
  "All Hands Working" = "All Hands Working",
  "First Alarm" = "First Alarm", 
  "2nd-3rd Alarm" = c("Second Alarm", "Third Alarm")
)

print(ctable(fire_data$highest_al_level, fire_data$highest_alarm_lev_new, prop = 'n', totals = FALSE, headings = FALSE), method = 'render')
highest_alarm_lev_new
highest_al_level All Hands
Working
First Alarm 2nd-3rd
Alarm
All Hands Working 100 0 0
First Alarm 0 49891 0
Second Alarm 0 0 8
Third Alarm 0 0 1

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-14

fire_data$highest_al_level <- fire_data$highest_alarm_lev_new
fire_data$highest_alarm_lev_new <- NULL

Here we merge the following factorial values of al_index_desc: Second Alarm, Third Alarm, 7-5 (All Hands Alarm), 10-76 & 10-77 Signal (Notification Hi-Rise Fire) and 10-75 Signal (Request for all hands alarm) into Others.

# al_index_desc
fire_data$alarm_level_idx_new <- fire_data$al_index_desc
levels(fire_data$alarm_level_idx_new) <- list(
  "DEFAULT RECORD" = "DEFAULT RECORD",
  "Initial Alarm" = "Initial Alarm", 
  "Others" = c("Second Alarm", "Third Alarm", "7-5 (All Hands Alarm)", 
               "10-76 & 10-77 Signal (Notification Hi-Rise Fire)",
               "10-75 Signal (Request for all hands alarm)")
)

print(ctable(fire_data$al_index_desc, fire_data$alarm_level_idx_new, prop = 'n', totals = FALSE, headings = FALSE), method = 'render')
alarm_level_idx_new
al_index_desc DEFAULT
RECORD
Initial
Alarm
Others
10-75 Signal (Request for all hands alarm) 0 0 13
10-76 & 10-77 Signal (Notification Hi-Rise Fire) 0 0 3
7-5 (All Hands Alarm) 0 0 100
DEFAULT RECORD 17313 0 0
Initial Alarm 0 32562 0
Second Alarm 0 0 8
Third Alarm 0 0 1

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-14

fire_data$al_index_desc <- fire_data$alarm_level_idx_new
fire_data$alarm_level_idx_new <- NULL

Here we merge the following factorial values of al_source_desc: 911, 911TEXT, VERBAL, BARS, ERS, ERS-NC and SOL into Others.

fire_data$alarm_source_desc_new <- fire_data$al_source_desc
levels(fire_data$alarm_source_desc_new) <- list(
  "PHONE" = "PHONE",
  "EMS" = "EMS",
  "EMS-911" = "EMS-911",
  "CLASS-3" = "CLASS-3",
  "Others" = c("911", "911TEXT", "VERBAL", "BARS", "ERS", "ERS-NC", "SOL")
)

print(ctable(fire_data$al_source_desc, fire_data$alarm_source_desc_new, prop = 'n', totals = FALSE, headings = FALSE), method = 'render')
alarm_source_desc_new
al_source_desc PHONE EMS EMS-911 CLASS-3 Others
911 0 0 0 0 302
911TEXT 0 0 0 0 14
BARS 0 0 0 0 1
CLASS-3 0 0 0 5025 0
EMS 0 17178 0 0 0
EMS-911 0 0 10520 0 0
ERS 0 0 0 0 777
ERS-NC 0 0 0 0 1
PHONE 15146 0 0 0 0
SOL 0 0 0 0 5
VERBAL 0 0 0 0 1031

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-14

fire_data$al_source_desc <- fire_data$alarm_source_desc_new
fire_data$alarm_source_desc_new <- NULL

View again the dataset summary to see the applied changes.

print(dfSummary(fire_data, 
                plain.ascii  = FALSE, 
                style        = "multiline", 
                headings     = FALSE,
                graph.magnif = 0.8, 
                valid.col    = FALSE),
                method = 'render')
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 id [character]
1. 230905-B0042-001-1051
2. 230905-B0053-001-0760
3. 230905-B0053-002-0910
4. 230905-B0081-001-1137
5. 230905-B0106-002-0632
6. 230905-B0132-001-0713
7. 230905-B0147-001-0967
8. 230905-B0160-001-1125
9. 230905-B0163-001-1026
10. 230905-B0165-001-0778
[ 49990 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
49990(100.0%)
0 (0.0%)
2 al_borough [factor]
1. Bronx
2. Brooklyn
3. Manhattan
4. Queens
5. Staten Island
10973(21.9%)
13980(28.0%)
12890(25.8%)
9879(19.8%)
2278(4.6%)
0 (0.0%)
3 al_number [integer]
Mean (sd) : 2930.3 (2446.5)
min ≤ med ≤ max:
10 ≤ 2275 ≤ 9933
IQR (CV) : 2772 (0.8)
7411 distinct values 0 (0.0%)
4 al_location [character]
1. 8 AVE & W 155 ST
2. 10 RICHMAN PLZ/SEDGWICK A
3. AMSTERDAM AVE & LA SALLE
4. 3 AVE & E 143 ST
5. WASHINGTON AVE & E 170 ST
6. FDR DR & E 6 ST
7. CONCOURSE VILLAGE E & E 1
8. PARK AVE & E 158 ST
9. UNION TPK & WINCHESTER BL
10. 8 AVE & W 33 ST
[ 12203 others ]
85(0.2%)
75(0.1%)
50(0.1%)
48(0.1%)
48(0.1%)
45(0.1%)
44(0.1%)
40(0.1%)
40(0.1%)
39(0.1%)
49486(99.0%)
0 (0.0%)
5 inc_borough [factor]
1. Bronx
2. Brooklyn
3. Manhattan
4. Queens
5. Staten Island
10973(21.9%)
13980(28.0%)
12890(25.8%)
9879(19.8%)
2278(4.6%)
0 (0.0%)
6 zipcode [integer]
Mean (sd) : 10737.9 (551.8)
min ≤ med ≤ max:
10000 ≤ 10472 ≤ 11697
IQR (CV) : 1098 (0.1)
217 distinct values 3181 (6.4%)
7 pol_prec [integer]
Mean (sd) : 62.3 (34.8)
min ≤ med ≤ max:
1 ≤ 61 ≤ 123
IQR (CV) : 56 (0.6)
77 distinct values 3180 (6.4%)
8 city_con_dist [integer]
Mean (sd) : 23.1 (15.1)
min ≤ med ≤ max:
1 ≤ 21 ≤ 51
IQR (CV) : 27 (0.7)
51 distinct values 3180 (6.4%)
9 commu_dist [integer]
Mean (sd) : 262.9 (119.4)
min ≤ med ≤ max:
101 ≤ 302 ≤ 595
IQR (CV) : 206 (0.5)
70 distinct values 3180 (6.4%)
10 commu_sc_dist [integer]
Mean (sd) : 14.8 (9.7)
min ≤ med ≤ max:
1 ≤ 13 ≤ 32
IQR (CV) : 18 (0.7)
32 distinct values 3182 (6.4%)
11 cong_dist [integer]
Mean (sd) : 10.4 (3.3)
min ≤ med ≤ max:
3 ≤ 11 ≤ 16
IQR (CV) : 5 (0.3)
13 distinct values 3180 (6.4%)
12 al_source_desc [factor]
1. PHONE
2. EMS
3. EMS-911
4. CLASS-3
5. Others
15146(30.3%)
17178(34.4%)
10520(21.0%)
5025(10.1%)
2131(4.3%)
0 (0.0%)
13 al_index_desc [factor]
1. DEFAULT RECORD
2. Initial Alarm
3. Others
17313(34.6%)
32562(65.1%)
125(0.2%)
0 (0.0%)
14 highest_al_level [factor]
1. All Hands Working
2. First Alarm
3. 2nd-3rd Alarm
100(0.2%)
49891(99.8%)
9(0.0%)
0 (0.0%)
15 inc_class [factor]
1. Abandoned Derelict Vehicl
2. Alarm System - Defective
3. Alarm System - Testing
4. Alarm System - Unnecessar
5. Assist Civilian - Non-Med
6. Automobile Fire
7. Brush Fire
8. Carbon Monoxide - Code 1
9. Carbon Monoxide - Code 2
10. Carbon Monoxide - Code 3
[ 57 others ]
7(0.0%)
387(0.8%)
728(1.5%)
2845(5.7%)
4140(8.3%)
106(0.2%)
27(0.1%)
813(1.6%)
133(0.3%)
92(0.2%)
40722(81.4%)
0 (0.0%)
16 inc_class_group [factor]
1. Medical Emergencies
2. Medical MFAs
3. NonMedical Emergencies
4. NonMedical MFAs
5. NonStructural Fires
6. Structural Fires
26824(53.6%)
208(0.4%)
19072(38.1%)
1680(3.4%)
703(1.4%)
1513(3.0%)
0 (0.0%)
17 disp_resp_min_qy [numeric]
Mean (sd) : 0.7 (2.2)
min ≤ med ≤ max:
0 ≤ 0.3 ≤ 150.4
IQR (CV) : 0.6 (3.3)
850 distinct values 0 (0.0%)
18 first_ass_datetime [POSIXct, POSIXt]
min : 2023-09-05 14:19:12
med : 2023-09-18 08:10:18.5
max : 2023-10-01 00:05:02
range : 25d 9H 45M 50S
49509 distinct values 0 (0.0%)
19 first_act_datetime [POSIXct, POSIXt]
min : 2023-09-05 14:19:26
med : 2023-09-18 08:09:01
max : 2023-10-01 00:05:16
range : 25d 9H 45M 50S
49205 distinct values 139 (0.3%)
20 first_onscene_datetime [POSIXct, POSIXt]
min : 2023-09-05 14:23:21
med : 2023-09-18 11:08:09.5
max : 2023-10-01 00:09:41
range : 25d 9H 46M 20S
35552 distinct values 14112 (28.2%)
21 inc_close_datetime [POSIXct, POSIXt]
min : 2023-09-05 14:25:05
med : 2023-09-18 08:35:24
max : 2023-10-01 00:58:42
range : 25d 10H 33M 37S
49409 distinct values 0 (0.0%)
22 disp_resp_min_indc [factor]
1. N
2. Y
50000(100.0%)
0(0.0%)
0 (0.0%)
23 inc_resp_min_indc [factor]
1. N
2. Y
17036(34.1%)
32964(65.9%)
0 (0.0%)
24 inc_resp_min_qy [numeric]
Mean (sd) : 6.4 (3.9)
min ≤ med ≤ max:
0.3 ≤ 5.6 ≤ 118.8
IQR (CV) : 2.7 (0.6)
1497 distinct values 14112 (28.2%)
25 inc_travel_min_qy [numeric]
Mean (sd) : 5.7 (3.5)
min ≤ med ≤ max:
0 ≤ 5 ≤ 118.7
IQR (CV) : 2.6 (0.6)
1369 distinct values 14112 (28.2%)
26 engines_assigned [integer]
Mean (sd) : 1.1 (0.8)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 0 (0.7)
15 distinct values 62 (0.1%)
27 ladders_assigned [integer]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 15
IQR (CV) : 1 (1.4)
12 distinct values 62 (0.1%)
28 others_units_assigned [integer]
Mean (sd) : 0.3 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 32
IQR (CV) : 0 (2.8)
23 distinct values 62 (0.1%)
29 emergency_min_qy [numeric]
Mean (sd) : 17.8 (27.7)
min ≤ med ≤ max:
0 ≤ 12 ≤ 2615
IQR (CV) : 13.6 (1.6)
4429 distinct values 14112 (28.2%)
30 day_type [factor]
1. Weekday
2. Weekend
36482(73.0%)
13518(27.0%)
0 (0.0%)
31 ticket_time [numeric]
Mean (sd) : 19.5 (26.1)
min ≤ med ≤ max:
0.3 ≤ 14.6 ≤ 2625
IQR (CV) : 15.4 (1.3)
4870 distinct values 0 (0.0%)
32 time_of_day [factor]
1. Night
2. Morning
3. Afternoon
4. Evening
8521(17.0%)
13270(26.5%)
16499(33.0%)
11710(23.4%)
0 (0.0%)
33 total_assigned_unit [integer]
Mean (sd) : 2 (2)
min ≤ med ≤ max:
1 ≤ 1 ≤ 66
IQR (CV) : 1 (1)
35 distinct values 62 (0.1%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-14

3.1 Dealing with invalid values

The next step is to deal invalid values and delete some un-useful predictors.

First of all we saw the possibility that al_borough and inc_borough represent the same column, let’s chek it.

identical(fire_data$al_borough, fire_data$inc_borough)
## [1] TRUE

The column `al_borough and inc_borough have the same sequence of values, so we can delete one of the two.

fire_data <- fire_data %>% select(-c(al_borough))

Then we say that all observation in the dataset have the disp_resp_min_indc equal to N, let’s check again and in affermative case then we can delete both columns.

summary(fire_data$disp_resp_min_indc)
##     N     Y 
## 50000     0

All our observations have non valid disp_resp_min_indc so we could delete both the column indicator and the respective column quantity disp_resp_min_qy. However we note that also in the original dataset all the observation have the disp_resp_min_indc set to N, which is quite strange, and seems that is problem relative to the data acquisition, thus we decide to mantein this time difference.

fire_data <- fire_data %>% select(-c(disp_resp_min_indc))

Now we do a quick check also on the other indicator variable inc_resp_min_indc

summary(fire_data$inc_resp_min_indc)
##     N     Y 
## 17036 32964

But here we have some observations with valid inc_resp_min_indc, and we will consider only the valid one deleting the one that has a non valid attribute.

However before doing that let’s see the distribution of inc_resp_min_qy around the borough.

ggplot(data=fire_data %>% group_by(inc_borough, inc_resp_min_indc) %>% summarise(incident_number = n()), 
       aes(x=inc_borough, y=incident_number, fill=inc_resp_min_indc)) +
  geom_bar(stat="identity", position=position_dodge()) +
  geom_text(aes(label=incident_number), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5) +
  scale_fill_brewer(palette="Paired") +
  labs(title = "Incident Count - Borouh - Valid Response Time in Minutes", x = "Borough", y = "Incident Number", fill = "Valid Response\n Time in Minutes")
## `summarise()` has grouped output by 'inc_borough'. You can override using the
## `.groups` argument.

We can see that the number of fire incident is higher for the valid response time in minutes but, it is much interesting observe the rateo between the valid and the non valid.

And to the rateo of valid inc_resp_min_indc in each borough is:

rateo_inc_resp_min_indc <- fire_data %>% 
  group_by(inc_borough, inc_resp_min_indc) %>% 
  summarise(incident_number = n()) %>% 
  mutate(ratio=incident_number/sum(incident_number))
## `summarise()` has grouped output by 'inc_borough'. You can override using the
## `.groups` argument.
ggplot(rateo_inc_resp_min_indc, aes(fill=inc_resp_min_indc, y=ratio, x=inc_borough)) + 
  geom_bar(position="fill", stat="identity") + 
  geom_text(aes(label=scales::percent(ratio)), position=position_fill(vjust=0.5)) +
  labs(title="Borough - Rateo Incident between Valid and Invalid",
       x="Borough",
       y="Rateo Incident between Valid and Invalid",
       fill="Valid Response\nTime in Minutes")

And we can see that Staten Island has the higher number of incidents with valid inc_resp_min_indc , whereas Manhattan has the lower number, but remember that the former has the lowest number of fire incident and the latter has the higher number of incident.

Now we do an additional analysis to see if there is some find of relation between the inc_resp_min_indc and total_assigned_unit.

ggplot(fire_data, aes(total_assigned_unit, inc_resp_min_qy)) + 
  geom_point(aes(colour = inc_resp_min_indc))+
   labs(title = "Total Assigned Units - Response Time In Minutes", x = "Total Assigned Units", y = "Response Time In Minutes", colour = "Valid Response\n Time in Minutes")
## Warning: Removed 14116 rows containing missing values (`geom_point()`).

We note that the majority of fire incident that had been assigned a single units has a high response time and the relative measure is not valid. Whereas for an higher number of total units the response time decrease and becomes valids.

ggplot(fire_data %>% filter(inc_resp_min_indc == "N")
            , aes(total_assigned_unit, inc_resp_min_qy)) + 
  geom_point(aes(colour = inc_class_group, shape = inc_class_group)) +
  labs(title = "Total Assigned Units - Response Time In Minutes - Incidnet Class Group", x = "Total Assigned Units", y = "Response Time In Minutes", colour = "Incident Class Groups")
## Warning: Removed 14112 rows containing missing values (`geom_point()`).

Regarding the incident class group around all the incidents with invalid response time had been assigned a single units as we discussed before, but in addition we found that are from the Medical Emergencies, whereas almost all the other incidents are from the NonMedical Emergencies.

# add an additional predictor
fire_data$tua_is_one <- as.factor(ifelse(fire_data$total_assigned_unit == 1, "Y", "N"))
  
tua_is_one <- fire_data %>% 
        filter(inc_resp_min_indc == "N", inc_class_group == "Medical Emergencies") %>%
        group_by(inc_borough, tua_is_one) %>%
        summarise(incident_number = n())
## `summarise()` has grouped output by 'inc_borough'. You can override using the
## `.groups` argument.
ggplot(data=tua_is_one, 
       aes(x=inc_borough, y=incident_number, fill=tua_is_one)) +
  geom_bar(stat="identity", position=position_dodge()) +
  geom_text(aes(label=incident_number), vjust=1.5, color="black",
            position = position_dodge(0.9), size=3.5) +
  scale_fill_brewer(palette="Set1") +
  labs(title = "Total Assigned Units One or Not", x = "Borough", y = "Incident Count", fill = "Total Assigned\nUnits are One")

We have also added an additional factorial predictor tua_is_one to indicates if the total assigned units is equal to one or not.

Continuing we decide to analyse the type of Incident Class of the invalid incidents response time that had been assigned a single total units.

ggplot(data=fire_data %>% 
        filter(inc_resp_min_indc == "N", inc_class_group == "Medical Emergencies", tua_is_one == "Y") %>%
        group_by(inc_class, inc_borough) %>%
        summarise(incident_number = n()), 
       aes(x=inc_borough, y=incident_number, fill=inc_class)) + 
      geom_bar(stat="identity", position=position_dodge()) +
        geom_text(aes(label=incident_number), vjust=1.6, color="black",
                  position = position_dodge(0.9), size=3) +
        scale_fill_brewer(palette="Set1") +
        labs(title = "Borough - Incident Counts - Incident Class -- for Total Assigned Units equal to 1", x = "Borough", y = "Incident Counts", fill = "Incident Class Group")
## `summarise()` has grouped output by 'inc_class'. You can override using the
## `.groups` argument.

And we found that the majority of the incident that respect these circumstances are mostly identified as Medical - EMS Link 10-91 and Medical - PD Link 10-91.

Thanks to the 10code site we found a description of the two emergency codes:

  1. 10-91 Medical Emergency EMS - Fire Unit Not Required - To be transmitted through borough dispatcher by the responding unit when the fire Unit is canceled enroute due to EMS on scene, or EMS downgrades the job to a segment that does not require a Fire Unit response. Note: This signal shall be used only for medical emergency incidents. EMS we are sure that stands for Emergency Medical Services.

  2. 10-91 Medical Emergency PD - Fire Unit Not Required - To be transmitted through borough dispatcher by the responding unit when the fire Unit is canceled enroute due to PD on scene, or PD downgrades the job to a segment that does not require a Fire Unit response. Note: This signal shall be used only for medical emergency incidents. PD we think that stands for Police Department.

Now we can look for the NonMedical Emergencies by first see the distribution of its incident class.

print(fire_data %>% 
        filter(inc_resp_min_indc == "N", inc_class_group == "NonMedical Emergencies") %>%
        group_by(inc_class) %>%
        summarise(incident_number = n()))
## # A tibble: 24 Ă— 2
##    inc_class                                         incident_number
##    <fct>                                                       <int>
##  1 Alarm System - Defective                                       10
##  2 Alarm System - Testing                                         22
##  3 Alarm System - Unnecessary                                    110
##  4 Assist Civilian - Non-Medical                                 828
##  5 Carbon Monoxide - Code 1 - Investigation                       25
##  6 Carbon Monoxide - Code 2 - Incident (1-9 ppm)                   4
##  7 Carbon Monoxide - Code 3 - Emergency (over 9 ppm)               4
##  8 Defective Oil Burner                                            5
##  9 Downed Tree                                                    28
## 10 Elevator Emergency - Occupied                                 104
## # ℹ 14 more rows
ggplot(data=fire_data %>% 
          filter(inc_resp_min_indc == "N", inc_class_group == "NonMedical Emergencies", inc_class == "Assist Civilian - Non-Medical") %>%
          group_by(inc_borough) %>%
          summarise(incident_number = n()), 
        aes(x=inc_borough, y=incident_number)) + 
      geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=1.6, color="white", position = position_dodge(0.9), size=3.5) +
      labs(title = "Incident Count - Borouh - Valid Response Time in Second", x = "Borough", y = "Incident Count")

And we found that the majority of non valid inc_resp_min_indc that are Non-Medical Emergency are from the incident class equal to Assist Civilian - Non-Medical.

For stake of consistency we will consider only the valid observations that have inc_resp_min_indc == "Y".

fire_data <- fire_data %>% filter(inc_resp_min_indc == "Y")
dim(fire_data)
## [1] 32964    32

Now we want to know how many inc_class are summarized in each inc_class_group, to be sure that each inc_class_group is referred to a single inc_class.

print(ctable(fire_data$inc_class, fire_data$inc_class_group, totals = FALSE, headings = FALSE), method = 'render')
inc_class_group
inc_class Medical
Emergencies
Medical MFAs NonMedical
Emergencies
NonMedical
MFAs
NonStructura
l Fires
Structural
Fires
Abandoned Derelict Vehicle Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 6 ( 100.0% ) 0 ( 0.0% )
Alarm System - Defective 0 ( 0.0% ) 0 ( 0.0% ) 377 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Alarm System - Testing 0 ( 0.0% ) 0 ( 0.0% ) 706 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Alarm System - Unnecessary 0 ( 0.0% ) 0 ( 0.0% ) 2735 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Assist Civilian - Non-Medical 0 ( 0.0% ) 0 ( 0.0% ) 3312 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Automobile Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 101 ( 100.0% ) 0 ( 0.0% )
Brush Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 24 ( 100.0% ) 0 ( 0.0% )
Carbon Monoxide - Code 1 - Investigation 0 ( 0.0% ) 0 ( 0.0% ) 788 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Carbon Monoxide - Code 2 - Incident (1-9 ppm) 0 ( 0.0% ) 0 ( 0.0% ) 129 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Carbon Monoxide - Code 3 - Emergency (over 9 ppm) 0 ( 0.0% ) 0 ( 0.0% ) 88 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Carbon Monoxide - Code 4 - No Detector Activation 0 ( 0.0% ) 0 ( 0.0% ) 8 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Church Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 10 ( 100.0% )
Defective Oil Burner 0 ( 0.0% ) 0 ( 0.0% ) 34 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Demolition Debris or Rubbish Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 272 ( 100.0% ) 0 ( 0.0% )
Downed Tree 0 ( 0.0% ) 0 ( 0.0% ) 280 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Elevator Emergency - Occupied 0 ( 0.0% ) 0 ( 0.0% ) 1850 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Elevator Emergency - Unoccupied 0 ( 0.0% ) 0 ( 0.0% ) 708 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Factory Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 1 ( 100.0% )
Hospital Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 18 ( 100.0% )
Manhole Fire - Blown Cover 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 9 ( 100.0% ) 0 ( 0.0% )
Manhole Fire - Other 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 55 ( 100.0% ) 0 ( 0.0% )
Manhole Fire - Seeping Smoke 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 104 ( 100.0% ) 0 ( 0.0% )
Maritime Emergency 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Maritime Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Medical - Assist Civilian 27 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Medical - Breathing / Ill or Sick 4779 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Medical - EMS Link 10-91 1096 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Medical - No PT Contact EMS is Onscene 4285 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Medical - PD Link 10-91 868 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Medical - Serious Life Threatening 366 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Medical - Victim Deceased 287 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Medical MFA - EMS Link 0 ( 0.0% ) 87 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Medical MFA - PD Link 0 ( 0.0% ) 77 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Multiple Dwelling 'A' - Compactor fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 4 ( 100.0% )
Multiple Dwelling 'A' - Food on the stove fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 519 ( 100.0% )
Multiple Dwelling 'A' - Other fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 168 ( 100.0% )
Multiple Dwelling 'B' Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 85 ( 100.0% )
Non-Medical 10-91 (Unnecessary Alarm) 0 ( 0.0% ) 0 ( 0.0% ) 102 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Non-Medical MFA - ERS 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 586 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Non-Medical MFA - ERS No Contact 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 1 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Non-Medical MFA - Phone 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 701 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Non-Medical MFA - Private Fire Alarm 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 223 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Non-Medical MFA - Verbal 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 7 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Odor - Other Smoke 0 ( 0.0% ) 0 ( 0.0% ) 166 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Odor - Other Than Smoke 0 ( 0.0% ) 0 ( 0.0% ) 1317 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Other Commercial Building Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 184 ( 100.0% )
Other Public Building Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 4 ( 100.0% )
Other Transportation Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 14 ( 100.0% ) 0 ( 0.0% )
Private Dwelling Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 412 ( 100.0% )
Remove Civilian - Non-Fire 0 ( 0.0% ) 0 ( 0.0% ) 27 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
School Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 31 ( 100.0% )
Sprinkler System - Activated 0 ( 0.0% ) 0 ( 0.0% ) 6 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Sprinkler System - Malfunction 0 ( 0.0% ) 0 ( 0.0% ) 41 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Sprinkler System - Working on System 0 ( 0.0% ) 0 ( 0.0% ) 28 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Store Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 9 ( 100.0% )
Transit System - NonStructural 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 59 ( 100.0% ) 0 ( 0.0% )
Transit System - Structural 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 1 ( 100.0% )
Transit System Emergency 0 ( 0.0% ) 0 ( 0.0% ) 18 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Undefined Emergency 0 ( 0.0% ) 0 ( 0.0% ) 71 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Under Contruction / Vacant Fire 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 1 ( 100.0% )
Utility Emergency - Electric 0 ( 0.0% ) 0 ( 0.0% ) 595 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Utility Emergency - Gas 0 ( 0.0% ) 0 ( 0.0% ) 1335 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Utility Emergency - Steam 0 ( 0.0% ) 0 ( 0.0% ) 137 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Utility Emergency - Undefined 0 ( 0.0% ) 0 ( 0.0% ) 4 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Utility Emergency - Water 0 ( 0.0% ) 0 ( 0.0% ) 1157 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Vehicle Accident - Other 0 ( 0.0% ) 0 ( 0.0% ) 1443 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Vehicle Accident - With Extrication 0 ( 0.0% ) 0 ( 0.0% ) 21 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-14

As we can see from the upper table all the inc_class_group have a unique set of values.

At this point to be more clear we display each main class with each respective sub-class.

for (variable in levels(fire_data$inc_class_group)) {
  non_zero_table <- table(subset(fire_data, inc_class_group == variable)$inc_class)
  cat(variable, "\n")
  print(non_zero_table[non_zero_table != 0])
  cat("\n")
}
## Medical Emergencies 
## 
##              Medical - Assist Civilian      Medical - Breathing / Ill or Sick 
##                                     27                                   4779 
##               Medical - EMS Link 10-91 Medical - No PT Contact EMS is Onscene 
##                                   1096                                   4285 
##                Medical - PD Link 10-91     Medical - Serious Life Threatening 
##                                    868                                    366 
##              Medical - Victim Deceased 
##                                    287 
## 
## Medical MFAs 
## 
## Medical MFA - EMS Link  Medical MFA - PD Link 
##                     87                     77 
## 
## NonMedical Emergencies 
## 
##                          Alarm System - Defective 
##                                               377 
##                            Alarm System - Testing 
##                                               706 
##                        Alarm System - Unnecessary 
##                                              2735 
##                     Assist Civilian - Non-Medical 
##                                              3312 
##          Carbon Monoxide - Code 1 - Investigation 
##                                               788 
##     Carbon Monoxide - Code 2 - Incident (1-9 ppm) 
##                                               129 
## Carbon Monoxide - Code 3 - Emergency (over 9 ppm) 
##                                                88 
## Carbon Monoxide - Code 4 - No Detector Activation 
##                                                 8 
##                              Defective Oil Burner 
##                                                34 
##                                       Downed Tree 
##                                               280 
##                     Elevator Emergency - Occupied 
##                                              1850 
##                   Elevator Emergency - Unoccupied 
##                                               708 
##             Non-Medical 10-91 (Unnecessary Alarm) 
##                                               102 
##                                Odor - Other Smoke 
##                                               166 
##                           Odor - Other Than Smoke 
##                                              1317 
##                        Remove Civilian - Non-Fire 
##                                                27 
##                      Sprinkler System - Activated 
##                                                 6 
##                    Sprinkler System - Malfunction 
##                                                41 
##              Sprinkler System - Working on System 
##                                                28 
##                          Transit System Emergency 
##                                                18 
##                               Undefined Emergency 
##                                                71 
##                      Utility Emergency - Electric 
##                                               595 
##                           Utility Emergency - Gas 
##                                              1335 
##                         Utility Emergency - Steam 
##                                               137 
##                     Utility Emergency - Undefined 
##                                                 4 
##                         Utility Emergency - Water 
##                                              1157 
##                          Vehicle Accident - Other 
##                                              1443 
##               Vehicle Accident - With Extrication 
##                                                21 
## 
## NonMedical MFAs 
## 
##                Non-Medical MFA - ERS     Non-Medical MFA - ERS No Contact 
##                                  586                                    1 
##              Non-Medical MFA - Phone Non-Medical MFA - Private Fire Alarm 
##                                  701                                  223 
##             Non-Medical MFA - Verbal 
##                                    7 
## 
## NonStructural Fires 
## 
##   Abandoned Derelict Vehicle Fire                   Automobile Fire 
##                                 6                               101 
##                        Brush Fire Demolition Debris or Rubbish Fire 
##                                24                               272 
##        Manhole Fire - Blown Cover              Manhole Fire - Other 
##                                 9                                55 
##      Manhole Fire - Seeping Smoke         Other Transportation Fire 
##                               104                                14 
##    Transit System - NonStructural 
##                                59 
## 
## Structural Fires 
## 
##                                    Church Fire 
##                                             10 
##                                   Factory Fire 
##                                              1 
##                                  Hospital Fire 
##                                             18 
##         Multiple Dwelling 'A' - Compactor fire 
##                                              4 
## Multiple Dwelling 'A' - Food on the stove fire 
##                                            519 
##             Multiple Dwelling 'A' - Other fire 
##                                            168 
##                     Multiple Dwelling 'B' Fire 
##                                             85 
##                 Other Commercial Building Fire 
##                                            184 
##                     Other Public Building Fire 
##                                              4 
##                          Private Dwelling Fire 
##                                            412 
##                                    School Fire 
##                                             31 
##                                     Store Fire 
##                                              9 
##                    Transit System - Structural 
##                                              1 
##                Under Contruction / Vacant Fire 
##                                              1

3.2 NA Patterns?

At this point is essential to deal with NA values, trying to find the presence of possible relation with predictors. First things first let’s recap the number of NA values for each columns that we have at the moment.

colSums(is.na(fire_data))
##                     id              al_number            al_location 
##                      0                      0                      0 
##            inc_borough                zipcode               pol_prec 
##                      0                   2197                   2197 
##          city_con_dist             commu_dist          commu_sc_dist 
##                   2197                   2197                   2198 
##              cong_dist         al_source_desc          al_index_desc 
##                   2197                      0                      0 
##       highest_al_level              inc_class        inc_class_group 
##                      0                      0                      0 
##       disp_resp_min_qy     first_ass_datetime     first_act_datetime 
##                      0                      0                     41 
## first_onscene_datetime     inc_close_datetime      inc_resp_min_indc 
##                      0                      0                      0 
##        inc_resp_min_qy      inc_travel_min_qy       engines_assigned 
##                      0                      0                      4 
##       ladders_assigned  others_units_assigned       emergency_min_qy 
##                      4                      4                      0 
##               day_type            ticket_time            time_of_day 
##                      0                      0                      0 
##    total_assigned_unit             tua_is_one 
##                      4                      4

3.2.1 Checking the location predictors

Here we will check if there is a pattern on the absence of values in the following predictors: zipcode, pol_prec, city_con_dist, commu_dist, commu_sc_dist and cong_dist.

na_locations <- fire_data %>%
  filter(is.na(zipcode) | is.na(pol_prec) | is.na(city_con_dist) | is.na(commu_dist) | is.na(commu_sc_dist) | is.na(cong_dist))
ggplot(data=na_locations %>% 
        group_by(inc_class_group, inc_borough) %>%
        summarise(incident_number = n()), 
       aes(x=inc_borough, y=incident_number, fill=inc_class_group)) + geom_bar(stat="identity", position=position_dodge()) +
        geom_text(aes(label=incident_number), vjust=1.6, color="black",
                  position = position_dodge(0.9), size=3.5) +
        #scale_fill_brewer(palette="Paired") +
        labs(title = "NA location", x = "Borough", y = "Incident Count", fill = "Incident Class Group")
## `summarise()` has grouped output by 'inc_class_group'. You can override using
## the `.groups` argument.

By the Bar Chart we note that the majority of observations that have at least one of the location predictors to NA are of the incident class group NonMedical Emergency, Non Medical MFAs and Medical Emergencies

table(na_locations$inc_borough) / table(fire_data$inc_borough)
## 
##         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##    0.07104538    0.04694547    0.07662157    0.07700328    0.07523697
table(na_locations$inc_class_group) / table(fire_data$inc_class_group)
## 
##    Medical Emergencies           Medical MFAs NonMedical Emergencies 
##             0.03988726             0.10975610             0.05691243 
##        NonMedical MFAs    NonStructural Fires       Structural Fires 
##             0.40447958             0.14906832             0.00552868

Moreover around the 40% of the whole incident that are of the incident class group NonMedical MFAs have at least one of the location columns to NA. Let’s investigate.

fd_nm_mfa_cl <- table(subset(fire_data, inc_class_group == "NonMedical MFAs")$inc_class)
fd_nm_mfa_bro <- table(subset(fire_data, inc_class_group == "NonMedical MFAs")$inc_borough)

fd_nm_mfa_cl <- fd_nm_mfa_cl[fd_nm_mfa_cl != 0]
fd_nm_mfa_cl
## 
##                Non-Medical MFA - ERS     Non-Medical MFA - ERS No Contact 
##                                  586                                    1 
##              Non-Medical MFA - Phone Non-Medical MFA - Private Fire Alarm 
##                                  701                                  223 
##             Non-Medical MFA - Verbal 
##                                    7

In the original dataset this is the distribution of inc_class for the NonMedical MF

na_nm_mfa_cl <- table(subset(na_locations, inc_class_group == "NonMedical MFAs")$inc_class)
na_nm_mfa_bro <- table(subset(na_locations, inc_class_group == "NonMedical MFAs")$inc_borough)

na_nm_mfa_cl <- na_nm_mfa_cl[names(fd_nm_mfa_cl)]
na_nm_mfa_cl
## 
##                Non-Medical MFA - ERS     Non-Medical MFA - ERS No Contact 
##                                  573                                    1 
##              Non-Medical MFA - Phone Non-Medical MFA - Private Fire Alarm 
##                                   38                                    2 
##             Non-Medical MFA - Verbal 
##                                    0
na_nm_mfa_cl / fd_nm_mfa_cl
## 
##                Non-Medical MFA - ERS     Non-Medical MFA - ERS No Contact 
##                           0.97781570                           1.00000000 
##              Non-Medical MFA - Phone Non-Medical MFA - Private Fire Alarm 
##                           0.05420827                           0.00896861 
##             Non-Medical MFA - Verbal 
##                           0.00000000

So the 97% of all the Non-Medical MFA - ERS observations in the entire dataset have one of the location attribute equal to NA

na_nm_mfa_bro / fd_nm_mfa_bro
## 
##         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##     0.4676056     0.3075221     0.3894472     0.3733333     0.7954545

And from here we can see that about the 78% of the observations that are NonMedical - MFAs that have at least one district column attribute to NA are from the RICHMOND / STATEN ISLAND. Also BRONX has about half of the NonMedical - MFAs observations that have at least one district column to NA.

3.2.2 Checking the assigned units predictors

print(fire_data %>%
  filter(is.na(engines_assigned) | is.na(ladders_assigned) | is.na(others_units_assigned)) %>%
  group_by(inc_borough, inc_class)) %>%
  summarise(incident_count = n())
## # A tibble: 4 Ă— 32
## # Groups:   inc_borough, inc_class [4]
##   id            al_number al_location inc_borough zipcode pol_prec city_con_dist
##   <chr>             <int> <chr>       <fct>         <int>    <int>         <int>
## 1 230905-Q4545…      4545 53 AVE & 6… Queens        11378      104            30
## 2 230914-Q1014…      1014 CENTRAL AV… Queens        11691      101            31
## 3 230918-Q9643…      9643 JAMAICA AV… Queens        11418      102            29
## 4 230919-M0684…       684 1 AVE & E … Manhattan     10016       13             4
## # ℹ 25 more variables: commu_dist <int>, commu_sc_dist <int>, cong_dist <int>,
## #   al_source_desc <fct>, al_index_desc <fct>, highest_al_level <fct>,
## #   inc_class <fct>, inc_class_group <fct>, disp_resp_min_qy <dbl>,
## #   first_ass_datetime <dttm>, first_act_datetime <dttm>,
## #   first_onscene_datetime <dttm>, inc_close_datetime <dttm>,
## #   inc_resp_min_indc <fct>, inc_resp_min_qy <dbl>, inc_travel_min_qy <dbl>,
## #   engines_assigned <int>, ladders_assigned <int>, …
## `summarise()` has grouped output by 'inc_borough'. You can override using the
## `.groups` argument.
## # A tibble: 4 Ă— 3
## # Groups:   inc_borough [2]
##   inc_borough inc_class                              incident_count
##   <fct>       <fct>                                           <int>
## 1 Manhattan   Vehicle Accident - Other                            1
## 2 Queens      Assist Civilian - Non-Medical                       1
## 3 Queens      Medical - No PT Contact EMS is Onscene              1
## 4 Queens      Medical - PD Link 10-91                             1

We can easily remove this observations.

3.2.3 Checking the first_act_datetime predictors

na_first_act_datetime <- fire_data %>% filter(is.na(first_act_datetime))
print(na_first_act_datetime %>% group_by(inc_class, inc_borough) %>% summarise(incident_count = n()))
## `summarise()` has grouped output by 'inc_class'. You can override using the
## `.groups` argument.
## # A tibble: 27 Ă— 3
## # Groups:   inc_class [15]
##    inc_class                         inc_borough   incident_count
##    <fct>                             <fct>                  <int>
##  1 Alarm System - Unnecessary        Brooklyn                   2
##  2 Assist Civilian - Non-Medical     Bronx                      1
##  3 Assist Civilian - Non-Medical     Brooklyn                   4
##  4 Assist Civilian - Non-Medical     Queens                     1
##  5 Demolition Debris or Rubbish Fire Brooklyn                   1
##  6 Downed Tree                       Manhattan                  1
##  7 Downed Tree                       Queens                     1
##  8 Downed Tree                       Staten Island              1
##  9 Elevator Emergency - Occupied     Brooklyn                   1
## 10 Elevator Emergency - Occupied     Manhattan                  1
## # ℹ 17 more rows
ggplot(data=na_first_act_datetime %>% 
        group_by(inc_class_group, inc_borough) %>%
        summarise(incident_number = n()), 
       aes(x=inc_borough, y=incident_number, fill=inc_class_group)) + geom_bar(stat="identity", position=position_dodge()) +
labs(title = "NA First Act Date", x = "Borough", y = "Incident Count", fill = "Incident Class Group")
## `summarise()` has grouped output by 'inc_class_group'. You can override using
## the `.groups` argument.

Seems to be random and thus there is no pattern that motivate the presence of NA values in first_act_datetime.

At this point we can omit the NA values.

fire_data_new <- na.omit(fire_data)

And the un-usefull predictors

fire_data_new <- fire_data_new %>% select(-c(zipcode, pol_prec, city_con_dist, commu_dist, al_location,
                                             commu_sc_dist, cong_dist, first_ass_datetime, first_act_datetime,
                                             first_onscene_datetime, inc_close_datetime, inc_resp_min_indc, 
                                             id, al_number, inc_class))
print(dfSummary(fire_data, 
                plain.ascii  = FALSE, 
                style        = "multiline", 
                headings     = FALSE,
                graph.magnif = 0.8, 
                valid.col    = FALSE),
                method = 'render')
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 id [character]
1. 230905-B0053-001-0760
2. 230905-B0053-002-0910
3. 230905-B0106-002-0632
4. 230905-B0132-001-0713
5. 230905-B0160-001-1125
6. 230905-B0165-001-0778
7. 230905-B0226-001-1046
8. 230905-B0232-001-1135
9. 230905-B0236-001-0963
10. 230905-B0238-001-0693
[ 32954 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
32954(100.0%)
0 (0.0%)
2 al_number [integer]
Mean (sd) : 2963.8 (2476)
min ≤ med ≤ max:
10 ≤ 2287 ≤ 9933
IQR (CV) : 2861 (0.8)
6836 distinct values 0 (0.0%)
3 al_location [character]
1. 8 AVE & W 155 ST
2. 10 RICHMAN PLZ/SEDGWICK A
3. MONTGOMERY ST & BEDFORD A
4. WASHINGTON AVE & E 170 ST
5. 3 AVE & E 143 ST
6. PARK AVE & E 158 ST
7. AMSTERDAM AVE & LA SALLE
8. CARLTON AVE & FULTON ST
9. WEBSTER AVE 600 N OF E 16
10. FDR DR & E 6 ST
[ 10786 others ]
74(0.2%)
54(0.2%)
33(0.1%)
33(0.1%)
32(0.1%)
32(0.1%)
31(0.1%)
29(0.1%)
29(0.1%)
28(0.1%)
32589(98.9%)
0 (0.0%)
4 inc_borough [factor]
1. Bronx
2. Brooklyn
3. Manhattan
4. Queens
5. Staten Island
6897(20.9%)
9756(29.6%)
7909(24.0%)
6714(20.4%)
1688(5.1%)
0 (0.0%)
5 zipcode [integer]
Mean (sd) : 10763.1 (548.7)
min ≤ med ≤ max:
10000 ≤ 11101 ≤ 11697
IQR (CV) : 925 (0.1)
210 distinct values 2197 (6.7%)
6 pol_prec [integer]
Mean (sd) : 64.2 (34.7)
min ≤ med ≤ max:
1 ≤ 66 ≤ 123
IQR (CV) : 60 (0.5)
77 distinct values 2197 (6.7%)
7 city_con_dist [integer]
Mean (sd) : 23.9 (15.1)
min ≤ med ≤ max:
1 ≤ 23 ≤ 51
IQR (CV) : 27 (0.6)
51 distinct values 2197 (6.7%)
8 commu_dist [integer]
Mean (sd) : 269.2 (119.2)
min ≤ med ≤ max:
101 ≤ 303 ≤ 503
IQR (CV) : 200 (0.4)
68 distinct values 2197 (6.7%)
9 commu_sc_dist [integer]
Mean (sd) : 15.3 (9.7)
min ≤ med ≤ max:
1 ≤ 14 ≤ 32
IQR (CV) : 17 (0.6)
32 distinct values 2198 (6.7%)
10 cong_dist [integer]
Mean (sd) : 10.3 (3.3)
min ≤ med ≤ max:
3 ≤ 10 ≤ 16
IQR (CV) : 5 (0.3)
13 distinct values 2197 (6.7%)
11 al_source_desc [factor]
1. PHONE
2. EMS
3. EMS-911
4. CLASS-3
5. Others
14110(42.8%)
7574(23.0%)
4854(14.7%)
4841(14.7%)
1585(4.8%)
0 (0.0%)
12 al_index_desc [factor]
1. DEFAULT RECORD
2. Initial Alarm
3. Others
3757(11.4%)
29088(88.2%)
119(0.4%)
0 (0.0%)
13 highest_al_level [factor]
1. All Hands Working
2. First Alarm
3. 2nd-3rd Alarm
95(0.3%)
32861(99.7%)
8(0.0%)
0 (0.0%)
14 inc_class [factor]
1. Abandoned Derelict Vehicl
2. Alarm System - Defective
3. Alarm System - Testing
4. Alarm System - Unnecessar
5. Assist Civilian - Non-Med
6. Automobile Fire
7. Brush Fire
8. Carbon Monoxide - Code 1
9. Carbon Monoxide - Code 2
10. Carbon Monoxide - Code 3
[ 57 others ]
6(0.0%)
377(1.1%)
706(2.1%)
2735(8.3%)
3312(10.0%)
101(0.3%)
24(0.1%)
788(2.4%)
129(0.4%)
88(0.3%)
24698(74.9%)
0 (0.0%)
15 inc_class_group [factor]
1. Medical Emergencies
2. Medical MFAs
3. NonMedical Emergencies
4. NonMedical MFAs
5. NonStructural Fires
6. Structural Fires
11708(35.5%)
164(0.5%)
17483(53.0%)
1518(4.6%)
644(2.0%)
1447(4.4%)
0 (0.0%)
16 disp_resp_min_qy [numeric]
Mean (sd) : 0.5 (0.5)
min ≤ med ≤ max:
0 ≤ 0.5 ≤ 34.7
IQR (CV) : 0.6 (1)
198 distinct values 0 (0.0%)
17 first_ass_datetime [POSIXct, POSIXt]
min : 2023-09-05 14:19:12
med : 2023-09-18 11:25:15
max : 2023-09-30 23:59:42
range : 25d 9H 40M 30S
32747 distinct values 0 (0.0%)
18 first_act_datetime [POSIXct, POSIXt]
min : 2023-09-05 14:19:26
med : 2023-09-18 11:25:27
max : 2023-09-30 23:59:53
range : 25d 9H 40M 27S
32617 distinct values 41 (0.1%)
19 first_onscene_datetime [POSIXct, POSIXt]
min : 2023-09-05 14:23:21
med : 2023-09-18 11:29:46
max : 2023-10-01 00:04:00
range : 25d 9H 40M 39S
32679 distinct values 0 (0.0%)
20 inc_close_datetime [POSIXct, POSIXt]
min : 2023-09-05 14:34:18
med : 2023-09-18 11:56:28
max : 2023-10-01 00:50:07
range : 25d 10H 15M 49S
32681 distinct values 0 (0.0%)
21 inc_resp_min_indc [factor]
1. N
2. Y
0(0.0%)
32964(100.0%)
0 (0.0%)
22 inc_resp_min_qy [numeric]
Mean (sd) : 6 (2.8)
min ≤ med ≤ max:
0.3 ≤ 5.5 ≤ 58.9
IQR (CV) : 2.5 (0.5)
1181 distinct values 0 (0.0%)
23 inc_travel_min_qy [numeric]
Mean (sd) : 5.5 (2.7)
min ≤ med ≤ max:
0 ≤ 5 ≤ 58.6
IQR (CV) : 2.5 (0.5)
1157 distinct values 0 (0.0%)
24 engines_assigned [integer]
Mean (sd) : 1.2 (0.9)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 0 (0.8)
15 distinct values 4 (0.0%)
25 ladders_assigned [integer]
Mean (sd) : 0.8 (0.8)
min ≤ med ≤ max:
0 ≤ 1 ≤ 15
IQR (CV) : 1 (1.1)
12 distinct values 4 (0.0%)
26 others_units_assigned [integer]
Mean (sd) : 0.4 (0.9)
min ≤ med ≤ max:
0 ≤ 0 ≤ 32
IQR (CV) : 1 (2.2)
22 distinct values 4 (0.0%)
27 emergency_min_qy [numeric]
Mean (sd) : 17.1 (20.3)
min ≤ med ≤ max:
0 ≤ 11.9 ≤ 944.8
IQR (CV) : 12.9 (1.2)
4170 distinct values 0 (0.0%)
28 day_type [factor]
1. Weekday
2. Weekend
24003(72.8%)
8961(27.2%)
0 (0.0%)
29 ticket_time [numeric]
Mean (sd) : 23.1 (20.5)
min ≤ med ≤ max:
0.8 ≤ 17.9 ≤ 947.7
IQR (CV) : 13.6 (0.9)
4389 distinct values 0 (0.0%)
30 time_of_day [factor]
1. Night
2. Morning
3. Afternoon
4. Evening
5096(15.5%)
8904(27.0%)
11205(34.0%)
7759(23.5%)
0 (0.0%)
31 total_assigned_unit [integer]
Mean (sd) : 2.4 (2.2)
min ≤ med ≤ max:
1 ≤ 1 ≤ 66
IQR (CV) : 2 (1)
34 distinct values 4 (0.0%)
32 tua_is_one [factor]
1. N
2. Y
13954(42.3%)
19006(57.7%)
4 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-14

3.3 Additional Data Visaulization

In this section we will have a look on additional data visualisation in order to better understand how the predictors behaves.

summary(fire_data_new)
##         inc_borough   al_source_desc         al_index_desc  
##  Bronx        :6406   PHONE  :13465   DEFAULT RECORD: 2979  
##  Brooklyn     :9280   EMS    : 7369   Initial Alarm :27631  
##  Manhattan    :7294   EMS-911: 4349   Others        :  118  
##  Queens       :6188   CLASS-3: 4817                         
##  Staten Island:1560   Others :  728                         
##                                                             
##           highest_al_level               inc_class_group  disp_resp_min_qy  
##  All Hands Working:   94   Medical Emergencies   :11222   Min.   : 0.03333  
##  First Alarm      :30626   Medical MFAs          :  146   1st Qu.: 0.13333  
##  2nd-3rd Alarm    :    8   NonMedical Emergencies:16471   Median : 0.48333  
##                            NonMedical MFAs       :  904   Mean   : 0.50062  
##                            NonStructural Fires   :  547   3rd Qu.: 0.71667  
##                            Structural Fires      : 1438   Max.   :34.73333  
##  inc_resp_min_qy  inc_travel_min_qy engines_assigned ladders_assigned 
##  Min.   : 0.350   Min.   : 0.000    Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 4.383   1st Qu.: 3.883    1st Qu.: 1.000   1st Qu.: 0.0000  
##  Median : 5.467   Median : 4.983    Median : 1.000   Median : 1.0000  
##  Mean   : 5.949   Mean   : 5.448    Mean   : 1.146   Mean   : 0.7795  
##  3rd Qu.: 6.850   3rd Qu.: 6.367    3rd Qu.: 1.000   3rd Qu.: 1.0000  
##  Max.   :58.917   Max.   :58.617    Max.   :19.000   Max.   :15.0000  
##  others_units_assigned emergency_min_qy    day_type      ticket_time     
##  Min.   : 0.0000       Min.   :  0.00   Weekday:22379   Min.   :  1.083  
##  1st Qu.: 0.0000       1st Qu.:  7.15   Weekend: 8349   1st Qu.: 12.850  
##  Median : 0.0000       Median : 11.93                   Median : 17.900  
##  Mean   : 0.3899       Mean   : 17.08                   Mean   : 23.027  
##  3rd Qu.: 1.0000       3rd Qu.: 19.58                   3rd Qu.: 25.954  
##  Max.   :32.0000       Max.   :596.38                   Max.   :601.717  
##     time_of_day    total_assigned_unit tua_is_one
##  Night    : 4487   Min.   : 1.000      N:12940   
##  Morning  : 8370   1st Qu.: 1.000      Y:17788   
##  Afternoon:10627   Median : 1.000                
##  Evening  : 7244   Mean   : 2.316                
##                    3rd Qu.: 3.000                
##                    Max.   :66.000

3.3.1 Y value: inc_resp_min_qy

ggplot(fire_data_new,
       aes(x = al_source_desc, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Alarm Source Description - Incident Minutes Response Time - Incident Class Borough",
       x = "Alarm Source Description", y = "Incident Minutes Response Time", color = "Incident Class\nBorough")

Here we can see that for the EMS-911 alarm source only Medical Emergencis, Medical MFAs and NonMedical Emergencies were arrived. Moreover NonMedical Emergencies comes most from phone, whereas Medical Emergencies from EMS.

ggplot(fire_data_new,
       aes(x = day_type, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Day Type - Incident Minutes Response Time - Incident Class Group",
       x = "Day Type", y = "Incident Minutes Response Time", color = "Incident Class\nGroup")

In this boxplot we approximately can’t see any pattern for both day type and borough, whoever there are many outliers for medical and non medical emergencies since are the two predominant incident class.

ggplot(fire_data_new,
       aes(x = time_of_day, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Day Range Time - Incident Minutes Response Time - Incident Class Group",
       x = "Day Range Time", y = "Incident Minutes Response Time", color = "Incident Class\nGroup")

Simillarly as the previous described boxpolt here there we can’t see any particular relevant pattern.

ggplot(fire_data_new, aes(x = total_assigned_unit, y = inc_resp_min_qy, group = inc_class_group)) +
  geom_point(aes(color = inc_class_group)) +
  labs(title = "Total Assagned Units - Incident Minutes Response Time - Incident Class Group",
       x = "Total Assagned Units", y = "Incident Minutes Response Time", color = "Incident Class\nGroup")

Whereas here as described in a previous similar chart, the total assigned units and the incident response time in minutes appear to be inversely proportional, with the contradistinction of structural fires that have many assigned units and low response time and Non Medical Emergencies that have low assigned units and high response time.

3.3.2 Y value: emergency_min_qy

Now we analyse the same box plot of before but now with a different response for the Y axis the emergency time in minutes.

ggplot(fire_data_new,
       aes(x = al_source_desc, y = emergency_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Alarm Source Description - Emergency Minutes Time - Incident Borough",
       x = "Alarm Source Description", y = "Emergency Minutes Time", color = "Incident Borough")

Here the first effect that we see is the high number of Structural Fires that are outside the fourth band for the phone alarm source description.

ggplot(fire_data_new,
       aes(x = day_type, y = emergency_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Day Type - Emergency Minutes Time - Incident Class Group",
       x = "Day Type", y = "Emergency Minutes Time", color = "Incident Class\nGroup")

Here instead we can’t see any pattern regarding day type, whereas we can see that the most outliers are from Structurl Fires in both day types.

ggplot(fire_data_new,
       aes(x = time_of_day, y = emergency_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Day Range Time - Emergency Minutes Time - Incident Class Group",
       x = "Day Range Time", y = "Emergency Minutes Time", color = "Incident Class\nGroup")

Again similar as the previous boxplots we can’t see any relevant patterns, except that the heavy outliers belonging to Structural Fires.

ggplot(fire_data_new, aes(x=total_assigned_unit, y=emergency_min_qy, group=inc_class_group)) +
  geom_point(aes(color=inc_class_group)) +
  labs(title = "Total Assagned Units - Emergency Minutes Time - Incident Class Group",
       x = "Total Assagned Units", y = "Emergency Minutes Time", color = "Incident Class\nGroup")

Finally here we can see that the Structural Fire incident have lots of variance directly and seems be a directly proportional relationship between the total assigned units and the Emergency Time. For the NonMedical Emergencies they are clustered and then for the Medical Emergencies we can see that they have been mostly assigend a single units.

3.3.3 Maps Visualization

In this section we plot additional data visualization focus on the geographical visualization of the New York borough with relative predictors. In order to do so we load an additional datasets:

The fdny-firehouse-listing.csv is a dataset that includes the geographical informations of every firefighter stations in the NYC, including again latitude and longitude.

firefighter_stations <- read.csv("datasets/fdny-firehouse-listing.csv")

head(firefighter_stations)
##                                              FacilityName       FacilityAddress
## 1                                      Engine 4/Ladder 15       42 South Street
## 2                                     Engine 10/Ladder 10    124 Liberty Street
## 3                                                Engine 6     49 Beekman Street
## 4 Engine 7/Ladder 1/Battalion 1/Manhattan Borough Command  100-104 Duane Street
## 5                                                Ladder 8 14 North Moore Street
## 6                                       Engine 9/Ladder 6       75 Canal Street
##     Borough Postcode Latitude Longitude Community.Board Community.Council
## 1 Manhattan    10005 40.70347 -74.00754               1                 1
## 2 Manhattan    10006 40.71007 -74.01252               1                 1
## 3 Manhattan    10038 40.71005 -74.00525               1                 1
## 4 Manhattan    10007 40.71546 -74.00594               1                 1
## 5 Manhattan    10013 40.71976 -74.00668               1                 1
## 6 Manhattan    10002 40.71521 -73.99290               3                 1
##   Census.Tract     BIN        BBL
## 1            7 1000867 1000350001
## 2           13 1075700 1000520022
## 3         1501 1001287 1000930030
## 4           33 1001647 1001500025
## 5           33 1002150 1001890035
## 6           16 1003898 1003000030
##                                                                           NTA
## 1 Battery Park City-Lower Manhattan                                          
## 2 Battery Park City-Lower Manhattan                                          
## 3 Battery Park City-Lower Manhattan                                          
## 4 SoHo-TriBeCa-Civic Center-Little Italy                                     
## 5 SoHo-TriBeCa-Civic Center-Little Italy                                     
## 6 Chinatown
summary(firefighter_stations)
##  FacilityName       FacilityAddress      Borough             Postcode    
##  Length:218         Length:218         Length:218         Min.   :10001  
##  Class :character   Class :character   Class :character   1st Qu.:10304  
##  Mode  :character   Mode  :character   Mode  :character   Median :11103  
##                                                           Mean   :10784  
##                                                           3rd Qu.:11231  
##                                                           Max.   :11695  
##                                                           NA's   :5      
##     Latitude       Longitude      Community.Board  Community.Council
##  Min.   :40.51   Min.   :-74.24   Min.   : 1.000   Min.   : 1.00    
##  1st Qu.:40.66   1st Qu.:-73.99   1st Qu.: 3.000   1st Qu.:12.00    
##  Median :40.72   Median :-73.94   Median : 6.000   Median :27.00    
##  Mean   :40.72   Mean   :-73.94   Mean   : 7.075   Mean   :25.63    
##  3rd Qu.:40.77   3rd Qu.:-73.89   3rd Qu.:11.000   3rd Qu.:38.00    
##  Max.   :40.89   Max.   :-73.72   Max.   :84.000   Max.   :51.00    
##  NA's   :5       NA's   :5        NA's   :5        NA's   :5        
##   Census.Tract         BIN               BBL                NTA           
##  Min.   :     1   Min.   :1000867   Min.   :1.000e+09   Length:218        
##  1st Qu.:   129   1st Qu.:2003268   1st Qu.:2.025e+09   Class :character  
##  Median :   275   Median :3064786   Median :3.025e+09   Mode  :character  
##  Mean   :  5950   Mean   :2900421   Mean   :2.850e+09                     
##  3rd Qu.:   800   3rd Qu.:4090228   3rd Qu.:4.033e+09                     
##  Max.   :157902   Max.   :5154879   Max.   :5.080e+09                     
##  NA's   :5        NA's   :5         NA's   :5

We now start with the firefighter stations dataset. By first making a copy of the fire_data_new and setting the borough from the firefighter_stations dataset to factor in order to be easily merged with the copied fire_data dataset.

# make a copy of the fire_data
fire_data_for_ffs <- fire_data_new

fire_data_for_ffs <- fire_data_for_ffs %>% rename(borough = inc_borough)

firefighter_stations$Borough <- as.factor(firefighter_stations$Borough)
firefighter_stations <- firefighter_stations %>% rename(borough = Borough)

# remove the na values from firefighter_stations
firefighter_stations <- na.omit(firefighter_stations)

Now we want to get the number of firefighter station for each borough.

stations_borough <- firefighter_stations %>%
                    group_by(borough) %>%
                    summarise(number_of_stations = n())

Now we want to get a summary of the incident count, the number of station and the incident per station of each borough in order to have a general view of the New York City situation.

count_inc_brough <- fire_data_for_ffs %>% group_by(borough) %>% summarise(incident_count = n())

stations_borough$incident_per_station <- round(count_inc_brough$incident_count / stations_borough$number_of_stations, digits = 3)

count_inc_brough <- merge(count_inc_brough, stations_borough, by="borough")

count_inc_brough
##         borough incident_count number_of_stations incident_per_station
## 1         Bronx           6406                 34              188.412
## 2      Brooklyn           9280                 64              145.000
## 3     Manhattan           7294                 47              155.191
## 4        Queens           6188                 48              128.917
## 5 Staten Island           1560                 20               78.000

Now we convert the firefighter_station data frame into a Spartial Data Frame to contains the geometry points.

firefighter_stations_sdf <- st_as_sf(firefighter_stations, coords = c("Longitude", "Latitude"), crs = 4326)
head(firefighter_stations_sdf)
## Simple feature collection with 6 features and 10 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -74.01252 ymin: 40.70347 xmax: -73.9929 ymax: 40.71976
## Geodetic CRS:  WGS 84
##                                              FacilityName       FacilityAddress
## 1                                      Engine 4/Ladder 15       42 South Street
## 2                                     Engine 10/Ladder 10    124 Liberty Street
## 3                                                Engine 6     49 Beekman Street
## 4 Engine 7/Ladder 1/Battalion 1/Manhattan Borough Command  100-104 Duane Street
## 5                                                Ladder 8 14 North Moore Street
## 6                                       Engine 9/Ladder 6       75 Canal Street
##     borough Postcode Community.Board Community.Council Census.Tract     BIN
## 1 Manhattan    10005               1                 1            7 1000867
## 2 Manhattan    10006               1                 1           13 1075700
## 3 Manhattan    10038               1                 1         1501 1001287
## 4 Manhattan    10007               1                 1           33 1001647
## 5 Manhattan    10013               1                 1           33 1002150
## 6 Manhattan    10002               3                 1           16 1003898
##          BBL
## 1 1000350001
## 2 1000520022
## 3 1000930030
## 4 1001500025
## 5 1001890035
## 6 1003000030
##                                                                           NTA
## 1 Battery Park City-Lower Manhattan                                          
## 2 Battery Park City-Lower Manhattan                                          
## 3 Battery Park City-Lower Manhattan                                          
## 4 SoHo-TriBeCa-Civic Center-Little Italy                                     
## 5 SoHo-TriBeCa-Civic Center-Little Italy                                     
## 6 Chinatown                                                                  
##                     geometry
## 1 POINT (-74.00754 40.70347)
## 2 POINT (-74.01252 40.71007)
## 3 POINT (-74.00525 40.71005)
## 4 POINT (-74.00594 40.71546)
## 5 POINT (-74.00668 40.71976)
## 6  POINT (-73.9929 40.71521)

3.3.3.1 Downloand of the geojson file

At this point we download the .geojson file that contain all the geometry of each borough in order to have a cool maps visualization of NYC.

geojson_newyork <- geojson_read("datasets/NYC_BoroughBoundaries.geojson",  what = "sp")
geojson_newyork <- setNames(geojson_newyork, c("borough_code", "borough", "shape_area", "shape_leng"))
geojson_newyork$borough <- as.factor(geojson_newyork$borough)
geojson_newyork$borough_code <- NULL
head(geojson_newyork)
##         borough    shape_area    shape_leng
## 1 Staten Island 1623620725.05  325917.35395
## 2     Manhattan 636520502.758 357713.308162
## 3         Bronx  1187174772.5 463180.579449
## 4      Brooklyn 1934138215.76 728146.574928
## 5        Queens 3041418506.64 888199.731385

And now we merge geojson_newyork with count_inc_brough maintaining the Spartial Data Frame type.

geojson_newyork@data = data.frame(geojson_newyork@data, count_inc_brough[match(geojson_newyork@data$borough, count_inc_brough$borough),])
geojson_newyork@data$borough.1 <- NULL

And finally we can plot the interactive map using the mapview function.

mapview(list(firefighter_stations_sdf, geojson_newyork),
        zcol = list(NULL, "incident_count"),
        legend = list(FALSE, TRUE),
        homebutton = list(FALSE, TRUE), layer.name = list(NULL, "indicents_number"), alpha.regions = 0.5, aplha = 1)

4 Let’s build some models (or at least try)

As suggested by the professor we have opted to solve a regression problem with response first inc_resp_min_qy, and then the emergency_min_qy. Initially we were thinking to solve a multi-classification / binary classification problem for the inc_class_group, however we were considering all the time difference predictors that are a future information w.r.t. the inc_class_group in prediction time, so it doesn’t make much sense to use them, and it is possible also that they will result in super predictors. That’s way we decided to grab the professor suggestion.

For both analysis we transform the relative response in log scale in order to simulate the behaviour of the Exponential and Gamma GLMs.

So first things first let’s check if there are some observations that have at least one of the time differences equal to zero.

summary(fire_data_new %>% select(disp_resp_min_qy, inc_travel_min_qy, inc_resp_min_qy, emergency_min_qy, ticket_time))
##  disp_resp_min_qy   inc_travel_min_qy inc_resp_min_qy  emergency_min_qy
##  Min.   : 0.03333   Min.   : 0.000    Min.   : 0.350   Min.   :  0.00  
##  1st Qu.: 0.13333   1st Qu.: 3.883    1st Qu.: 4.383   1st Qu.:  7.15  
##  Median : 0.48333   Median : 4.983    Median : 5.467   Median : 11.93  
##  Mean   : 0.50062   Mean   : 5.448    Mean   : 5.949   Mean   : 17.08  
##  3rd Qu.: 0.71667   3rd Qu.: 6.367    3rd Qu.: 6.850   3rd Qu.: 19.58  
##  Max.   :34.73333   Max.   :58.617    Max.   :58.917   Max.   :596.38  
##   ticket_time     
##  Min.   :  1.083  
##  1st Qu.: 12.850  
##  Median : 17.900  
##  Mean   : 23.027  
##  3rd Qu.: 25.954  
##  Max.   :601.717
fire_data_new <- fire_data_new %>% filter(inc_travel_min_qy != 0, emergency_min_qy != 0)

Then we have to check the presence of correlation in the continuous predictor and if so deleting one or more of them.

round(cor(fire_data_new %>% dplyr::select(where(is.numeric)))^2, digits=3)
##                       disp_resp_min_qy inc_resp_min_qy inc_travel_min_qy
## disp_resp_min_qy                 1.000           0.018             0.003
## inc_resp_min_qy                  0.018           1.000             0.967
## inc_travel_min_qy                0.003           0.967             1.000
## engines_assigned                 0.011           0.064             0.075
## ladders_assigned                 0.163           0.015             0.039
## others_units_assigned            0.025           0.023             0.033
## emergency_min_qy                 0.002           0.000             0.000
## ticket_time                      0.001           0.016             0.018
## total_assigned_unit              0.065           0.045             0.068
##                       engines_assigned ladders_assigned others_units_assigned
## disp_resp_min_qy                 0.011            0.163                 0.025
## inc_resp_min_qy                  0.064            0.015                 0.023
## inc_travel_min_qy                0.075            0.039                 0.033
## engines_assigned                 1.000            0.317                 0.295
## ladders_assigned                 0.317            1.000                 0.325
## others_units_assigned            0.295            0.325                 1.000
## emergency_min_qy                 0.046            0.017                 0.121
## ticket_time                      0.031            0.012                 0.105
## total_assigned_unit              0.719            0.693                 0.704
##                       emergency_min_qy ticket_time total_assigned_unit
## disp_resp_min_qy                 0.002       0.001               0.065
## inc_resp_min_qy                  0.000       0.016               0.045
## inc_travel_min_qy                0.000       0.018               0.068
## engines_assigned                 0.046       0.031               0.719
## ladders_assigned                 0.017       0.012               0.693
## others_units_assigned            0.121       0.105               0.704
## emergency_min_qy                 1.000       0.981               0.077
## ticket_time                      0.981       1.000               0.060
## total_assigned_unit              0.077       0.060               1.000

As we can see total_assigned_unit is heavily correlated to the other counts since it is the sum of those, that’s we we decided to remove from the dataframe. Continuing we note also that lot’s of time difference are correlated to each other, whoever it is obvious since some of them include other smaller difference, these measures will be managed soon once we deal with the two type of analysis. We have done this step to remove singluarities for the future models.

fire_data_new <- fire_data_new %>% select(-c(total_assigned_unit))

Next before creating any model have to split the cleaned dataset into train and test, with 0.8% of the whole dataset for the train set and the remaining 20% for the test set.

set.seed(43)
split <- sample.split(fire_data_new, SplitRatio = 0.8)

# Create training and testing sets
fire_data.train <- subset(fire_data_new, split == TRUE)
fire_data.test <- subset(fire_data_new, split == FALSE)

rownames(fire_data.train) <- NULL
rownames(fire_data.test) <- NULL

dim(fire_data.train)
## [1] 23043    16
dim(fire_data.test)
## [1] 7682   16

4.1 Linear Regression???

4.1.1 Use inc_resp_min_qy as response

In this section we use inc_resp_min_qy as response, so we have to remove all the time difference predictors that are computed with one of the two datetime that comes after the incident datetime, so all the other except for our actual response.

# make a copy of the train and test
resp_min_fd.train <- fire_data.train
resp_min_fd.test <- fire_data.test

# remove the future time differences
resp_min_fd.train <- resp_min_fd.train %>% select(-c(disp_resp_min_qy, inc_travel_min_qy, emergency_min_qy, ticket_time))
resp_min_fd.test <- resp_min_fd.test %>% select(-c(disp_resp_min_qy, inc_travel_min_qy, emergency_min_qy, ticket_time))

Let’s build our first Linear Regression Model

lm_irm_full <- lm(inc_resp_min_qy ~ ., data = resp_min_fd.train)
summary(lm_irm_full)
## 
## Call:
## lm(formula = inc_resp_min_qy ~ ., data = resp_min_fd.train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.014 -1.359 -0.384  0.814 52.604 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            5.80348    0.91357   6.353 2.16e-10 ***
## inc_boroughBrooklyn                   -1.01049    0.04763 -21.216  < 2e-16 ***
## inc_boroughManhattan                  -0.14836    0.05048  -2.939 0.003296 ** 
## inc_boroughQueens                     -0.31298    0.05263  -5.947 2.77e-09 ***
## inc_boroughStaten Island              -0.73366    0.08366  -8.769  < 2e-16 ***
## al_source_descEMS                     -0.21151    0.11172  -1.893 0.058356 .  
## al_source_descEMS-911                 -0.27526    0.11282  -2.440 0.014701 *  
## al_source_descCLASS-3                  0.08852    0.05590   1.584 0.113303    
## al_source_descOthers                  -1.09849    0.11663  -9.419  < 2e-16 ***
## al_index_descInitial Alarm            -0.21548    0.07336  -2.937 0.003315 ** 
## al_index_descOthers                    1.46382    0.88917   1.646 0.099720 .  
## highest_al_levelFirst Alarm            0.84831    0.90124   0.941 0.346576    
## highest_al_level2nd-3rd Alarm          3.84181    1.23754   3.104 0.001909 ** 
## inc_class_groupMedical MFAs           -0.01443    0.24796  -0.058 0.953585    
## inc_class_groupNonMedical Emergencies  0.45755    0.10936   4.184 2.88e-05 ***
## inc_class_groupNonMedical MFAs         0.34701    0.15923   2.179 0.029316 *  
## inc_class_groupNonStructural Fires    -0.21984    0.16588  -1.325 0.185091    
## inc_class_groupStructural Fires       -0.03962    0.13924  -0.285 0.775974    
## engines_assigned                      -0.46587    0.02833 -16.446  < 2e-16 ***
## ladders_assigned                       0.14897    0.04505   3.306 0.000947 ***
## others_units_assigned                 -0.10067    0.03025  -3.328 0.000877 ***
## day_typeWeekend                       -0.19471    0.03763  -5.175 2.30e-07 ***
## time_of_dayMorning                    -0.18403    0.05436  -3.385 0.000712 ***
## time_of_dayAfternoon                  -0.37902    0.05232  -7.244 4.48e-13 ***
## time_of_dayEvening                    -0.65517    0.05577 -11.749  < 2e-16 ***
## tua_is_oneY                            1.08088    0.06341  17.047  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.536 on 23017 degrees of freedom
## Multiple R-squared:  0.1209, Adjusted R-squared:   0.12 
## F-statistic: 126.6 on 25 and 23017 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_irm_full)

Now we have to see if the linearity assumption are met and thus if we can use a linear regression model for our analysis. 1. Residuals vs Fitted plot: here we can see if our residuals have a linear pattern and this is confermed by the straight horizontal red line. Even if,we have an higher amount of spreaded observations on the top of the red line. 2. Q-Q Residuals plot: in this plot called also quantile - quantile residual plot and tells us if the residuals are normally distributed or not. If they follows the 45 degrees dotted line we can say so otherwise as in our case we can’t say that are normally distributed, as we will see in much detail later. 3. Scaled-Location / Spread-Location plot: tells us if the residuals are equally spread across the predictors. This is the assessments of Homoscedasticity or equal variance. And we would like to see a sort of horizontal line, something more or less in this case. 4. Residuals VS Leverage plot: helps us to identify the influential points with the Cook’s distance, so points that have influence on the regression line. And if some point feed in the area delimited by the dotted lines those points will be assigned as influential. In our case we do not have any observations that satisfy what we have just saied.

qqPlot(residuals(lm_irm_full))

## [1] 11795 20875

Much clearly the qqPlot tells that the data the residuals are not normally distributed indeed are heavily right skewed. Thus we can’t trust the p-values and the estimation of the coefficients.

residualPlots(lm_irm_full)

##                       Test stat Pr(>|Test stat|)    
## inc_borough                                         
## al_source_desc                                      
## al_index_desc                                       
## highest_al_level                                    
## inc_class_group                                     
## engines_assigned         6.5604        5.480e-11 ***
## ladders_assigned         5.5546        2.813e-08 ***
## others_units_assigned    4.5496        5.403e-06 ***
## day_type                                            
## time_of_day                                         
## tua_is_one                                          
## Tukey test               6.3052        2.878e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here instead we have a look of all plots of residuals vs predictors and again the plot of residuals vs fitted values that we already see.

Let’s have a look of the possible power transformation of the response.

powerTransform(lm_irm_full)
## Estimated transformation parameter 
##        Y1 
## 0.1012643

The function powerTransform suggests to take the log-transformation of the response, we take the log transformation because the estimated value of lambda is close to zero.

lm_irm_full_upd <- update(lm_irm_full, log(inc_resp_min_qy) ~ .)
summary(lm_irm_full_upd)
## 
## Call:
## lm(formula = log(inc_resp_min_qy) ~ inc_borough + al_source_desc + 
##     al_index_desc + highest_al_level + inc_class_group + engines_assigned + 
##     ladders_assigned + others_units_assigned + day_type + time_of_day + 
##     tua_is_one, data = resp_min_fd.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47374 -0.19977 -0.00464  0.19592  2.69052 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            1.7180302  0.1334082  12.878  < 2e-16
## inc_boroughBrooklyn                   -0.1770990  0.0069553 -25.463  < 2e-16
## inc_boroughManhattan                  -0.0373227  0.0073715  -5.063 4.16e-07
## inc_boroughQueens                     -0.0465490  0.0076855  -6.057 1.41e-09
## inc_boroughStaten Island              -0.1199289  0.0122171  -9.816  < 2e-16
## al_source_descEMS                     -0.0203318  0.0163151  -1.246 0.212705
## al_source_descEMS-911                 -0.0339945  0.0164747  -2.063 0.039083
## al_source_descCLASS-3                  0.0399158  0.0081626   4.890 1.01e-06
## al_source_descOthers                  -0.4691826  0.0170311 -27.549  < 2e-16
## al_index_descInitial Alarm            -0.0287322  0.0107130  -2.682 0.007324
## al_index_descOthers                    0.1761202  0.1298452   1.356 0.174990
## highest_al_levelFirst Alarm            0.1361029  0.1316083   1.034 0.301076
## highest_al_level2nd-3rd Alarm          0.6022132  0.1807176   3.332 0.000863
## inc_class_groupMedical MFAs            0.0003317  0.0362090   0.009 0.992691
## inc_class_groupNonMedical Emergencies  0.0788922  0.0159704   4.940 7.87e-07
## inc_class_groupNonMedical MFAs         0.0574061  0.0232520   2.469 0.013562
## inc_class_groupNonStructural Fires    -0.0250603  0.0242238  -1.035 0.300898
## inc_class_groupStructural Fires       -0.0223811  0.0203328  -1.101 0.271024
## engines_assigned                      -0.0749646  0.0041366 -18.122  < 2e-16
## ladders_assigned                       0.0104863  0.0065793   1.594 0.110990
## others_units_assigned                 -0.0075662  0.0044175  -1.713 0.086771
## day_typeWeekend                       -0.0251404  0.0054948  -4.575 4.78e-06
## time_of_dayMorning                    -0.0653316  0.0079380  -8.230  < 2e-16
## time_of_dayAfternoon                  -0.0905598  0.0076402 -11.853  < 2e-16
## time_of_dayEvening                    -0.1252760  0.0081434 -15.384  < 2e-16
## tua_is_oneY                            0.1538454  0.0092591  16.616  < 2e-16
##                                          
## (Intercept)                           ***
## inc_boroughBrooklyn                   ***
## inc_boroughManhattan                  ***
## inc_boroughQueens                     ***
## inc_boroughStaten Island              ***
## al_source_descEMS                        
## al_source_descEMS-911                 *  
## al_source_descCLASS-3                 ***
## al_source_descOthers                  ***
## al_index_descInitial Alarm            ** 
## al_index_descOthers                      
## highest_al_levelFirst Alarm              
## highest_al_level2nd-3rd Alarm         ***
## inc_class_groupMedical MFAs              
## inc_class_groupNonMedical Emergencies ***
## inc_class_groupNonMedical MFAs        *  
## inc_class_groupNonStructural Fires       
## inc_class_groupStructural Fires          
## engines_assigned                      ***
## ladders_assigned                         
## others_units_assigned                 .  
## day_typeWeekend                       ***
## time_of_dayMorning                    ***
## time_of_dayAfternoon                  ***
## time_of_dayEvening                    ***
## tua_is_oneY                           ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3703 on 23017 degrees of freedom
## Multiple R-squared:  0.163,  Adjusted R-squared:  0.162 
## F-statistic: 179.2 on 25 and 23017 DF,  p-value: < 2.2e-16

Now remember that we can’t compere the \(R^2\) with the previous model since in the last one the response is on a different scale.

Let’s see how the residuals behaves in this new model.

par(mfrow=c(2,2))
plot(lm_irm_full_upd)

Like the previous model we can say the residuals follow a linear pattern much better that the previous model. Again we do not have any influential point. But on the other hand the qqPlot is pretty much a mess indicating that the residuals are not normally distributed, by this qqPlot we can say that:

  1. The smallest observations are larger than you would expect from a normal distribution (i.e. the points are above the line on the QQ-plot). This means the lower tail of the data’s distribution has been reduced, relative to a normal distribution.
  2. The largest observations are less than you would expect from a normal distribution (i.e. the points are below the line on the QQ-plot). This means the upper tail of the data’s distribution has been reduced, relative to a normal distribution.

The qqPlot is buch clear here, where we can see also the residuals vs predictors and residuals vs fitted values.

residualPlots(lm_irm_full_upd)

##                       Test stat Pr(>|Test stat|)    
## inc_borough                                         
## al_source_desc                                      
## al_index_desc                                       
## highest_al_level                                    
## inc_class_group                                     
## engines_assigned         5.5255        3.321e-08 ***
## ladders_assigned         5.1370        2.815e-07 ***
## others_units_assigned    4.4142        1.018e-05 ***
## day_type                                            
## time_of_day                                         
## tua_is_one                                          
## Tukey test               6.1078        1.010e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqPlot(residuals(lm_irm_full_upd))

## [1]  8399 11439

Thus again the linear assumptions are not meet, mainly by the non normal distribution of the residuals.

At this point we have decided in any case to investigate this behaviour trying to fix the non normality of the residuals by modifying the scale of some predictors and adding interaction term between them.

We start by modifying the previous model by adding the interaction term and scaling the numbe of assigned units by the logarithm scale after having increased by a single units.

lm_irm_full_upd_2 <- update(lm_irm_full_upd, log(inc_resp_min_qy) ~ . + engines_assigned : inc_class_group + time_of_day : day_type + log(ladders_assigned + 1) + log(engines_assigned + 1) + log(others_units_assigned + 1) + log(engines_assigned + 1) : inc_class_group)
summary(lm_irm_full_upd_2)
## 
## Call:
## lm(formula = log(inc_resp_min_qy) ~ inc_borough + al_source_desc + 
##     al_index_desc + highest_al_level + inc_class_group + engines_assigned + 
##     ladders_assigned + others_units_assigned + day_type + time_of_day + 
##     tua_is_one + log(ladders_assigned + 1) + log(engines_assigned + 
##     1) + log(others_units_assigned + 1) + inc_class_group:engines_assigned + 
##     day_type:time_of_day + inc_class_group:log(engines_assigned + 
##     1), data = resp_min_fd.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.67577 -0.19744 -0.00207  0.19826  2.55364 
## 
## Coefficients:
##                                                                  Estimate
## (Intercept)                                                      1.762201
## inc_boroughBrooklyn                                             -0.179870
## inc_boroughManhattan                                            -0.041988
## inc_boroughQueens                                               -0.045810
## inc_boroughStaten Island                                        -0.122447
## al_source_descEMS                                               -0.026849
## al_source_descEMS-911                                           -0.041289
## al_source_descCLASS-3                                            0.032810
## al_source_descOthers                                            -0.499088
## al_index_descInitial Alarm                                      -0.030575
## al_index_descOthers                                             -0.005135
## highest_al_levelFirst Alarm                                      0.120641
## highest_al_level2nd-3rd Alarm                                   -0.924549
## inc_class_groupMedical MFAs                                     -0.021754
## inc_class_groupNonMedical Emergencies                            0.426759
## inc_class_groupNonMedical MFAs                                   0.574809
## inc_class_groupNonStructural Fires                              -0.064680
## inc_class_groupStructural Fires                                  0.383284
## engines_assigned                                                -0.195436
## ladders_assigned                                                 0.291426
## others_units_assigned                                            0.003291
## day_typeWeekend                                                  0.002372
## time_of_dayMorning                                              -0.054184
## time_of_dayAfternoon                                            -0.081539
## time_of_dayEvening                                              -0.130137
## tua_is_oneY                                                     -0.034157
## log(ladders_assigned + 1)                                       -0.679472
## log(engines_assigned + 1)                                        0.419757
## log(others_units_assigned + 1)                                  -0.049386
## inc_class_groupMedical MFAs:engines_assigned                     0.123800
## inc_class_groupNonMedical Emergencies:engines_assigned           0.340628
## inc_class_groupNonMedical MFAs:engines_assigned                  0.523430
## inc_class_groupNonStructural Fires:engines_assigned              0.161493
## inc_class_groupStructural Fires:engines_assigned                 0.202559
## day_typeWeekend:time_of_dayMorning                              -0.053103
## day_typeWeekend:time_of_dayAfternoon                            -0.047725
## day_typeWeekend:time_of_dayEvening                               0.013392
## inc_class_groupMedical MFAs:log(engines_assigned + 1)           -0.158047
## inc_class_groupNonMedical Emergencies:log(engines_assigned + 1) -1.012682
## inc_class_groupNonMedical MFAs:log(engines_assigned + 1)        -1.591994
## inc_class_groupNonStructural Fires:log(engines_assigned + 1)    -0.286924
## inc_class_groupStructural Fires:log(engines_assigned + 1)       -0.796810
##                                                                 Std. Error
## (Intercept)                                                       0.136218
## inc_boroughBrooklyn                                               0.006888
## inc_boroughManhattan                                              0.007297
## inc_boroughQueens                                                 0.007616
## inc_boroughStaten Island                                          0.012097
## al_source_descEMS                                                 0.016447
## al_source_descEMS-911                                             0.016576
## al_source_descCLASS-3                                             0.008296
## al_source_descOthers                                              0.016993
## al_index_descInitial Alarm                                        0.010597
## al_index_descOthers                                               0.137273
## highest_al_levelFirst Alarm                                       0.132032
## highest_al_level2nd-3rd Alarm                                     0.272628
## inc_class_groupMedical MFAs                                       0.164270
## inc_class_groupNonMedical Emergencies                             0.028837
## inc_class_groupNonMedical MFAs                                    0.041492
## inc_class_groupNonStructural Fires                                0.094043
## inc_class_groupStructural Fires                                   0.061241
## engines_assigned                                                  0.075380
## ladders_assigned                                                  0.026140
## others_units_assigned                                             0.009306
## day_typeWeekend                                                   0.013525
## time_of_dayMorning                                                0.009375
## time_of_dayAfternoon                                              0.009052
## time_of_dayEvening                                                0.009666
## tua_is_oneY                                                       0.014256
## log(ladders_assigned + 1)                                         0.050962
## log(engines_assigned + 1)                                         0.122205
## log(others_units_assigned + 1)                                    0.021401
## inc_class_groupMedical MFAs:engines_assigned                      0.270299
## inc_class_groupNonMedical Emergencies:engines_assigned            0.076922
## inc_class_groupNonMedical MFAs:engines_assigned                   0.090692
## inc_class_groupNonStructural Fires:engines_assigned               0.110614
## inc_class_groupStructural Fires:engines_assigned                  0.083385
## day_typeWeekend:time_of_dayMorning                                0.017182
## day_typeWeekend:time_of_dayAfternoon                              0.016465
## day_typeWeekend:time_of_dayEvening                                0.017479
## inc_class_groupMedical MFAs:log(engines_assigned + 1)             0.557364
## inc_class_groupNonMedical Emergencies:log(engines_assigned + 1)   0.128703
## inc_class_groupNonMedical MFAs:log(engines_assigned + 1)          0.171504
## inc_class_groupNonStructural Fires:log(engines_assigned + 1)      0.266366
## inc_class_groupStructural Fires:log(engines_assigned + 1)         0.168937
##                                                                 t value
## (Intercept)                                                      12.937
## inc_boroughBrooklyn                                             -26.114
## inc_boroughManhattan                                             -5.754
## inc_boroughQueens                                                -6.015
## inc_boroughStaten Island                                        -10.122
## al_source_descEMS                                                -1.632
## al_source_descEMS-911                                            -2.491
## al_source_descCLASS-3                                             3.955
## al_source_descOthers                                            -29.370
## al_index_descInitial Alarm                                       -2.885
## al_index_descOthers                                              -0.037
## highest_al_levelFirst Alarm                                       0.914
## highest_al_level2nd-3rd Alarm                                    -3.391
## inc_class_groupMedical MFAs                                      -0.132
## inc_class_groupNonMedical Emergencies                            14.799
## inc_class_groupNonMedical MFAs                                   13.854
## inc_class_groupNonStructural Fires                               -0.688
## inc_class_groupStructural Fires                                   6.259
## engines_assigned                                                 -2.593
## ladders_assigned                                                 11.149
## others_units_assigned                                             0.354
## day_typeWeekend                                                   0.175
## time_of_dayMorning                                               -5.780
## time_of_dayAfternoon                                             -9.007
## time_of_dayEvening                                              -13.463
## tua_is_oneY                                                      -2.396
## log(ladders_assigned + 1)                                       -13.333
## log(engines_assigned + 1)                                         3.435
## log(others_units_assigned + 1)                                   -2.308
## inc_class_groupMedical MFAs:engines_assigned                      0.458
## inc_class_groupNonMedical Emergencies:engines_assigned            4.428
## inc_class_groupNonMedical MFAs:engines_assigned                   5.772
## inc_class_groupNonStructural Fires:engines_assigned               1.460
## inc_class_groupStructural Fires:engines_assigned                  2.429
## day_typeWeekend:time_of_dayMorning                               -3.091
## day_typeWeekend:time_of_dayAfternoon                             -2.899
## day_typeWeekend:time_of_dayEvening                                0.766
## inc_class_groupMedical MFAs:log(engines_assigned + 1)            -0.284
## inc_class_groupNonMedical Emergencies:log(engines_assigned + 1)  -7.868
## inc_class_groupNonMedical MFAs:log(engines_assigned + 1)         -9.283
## inc_class_groupNonStructural Fires:log(engines_assigned + 1)     -1.077
## inc_class_groupStructural Fires:log(engines_assigned + 1)        -4.717
##                                                                 Pr(>|t|)    
## (Intercept)                                                      < 2e-16 ***
## inc_boroughBrooklyn                                              < 2e-16 ***
## inc_boroughManhattan                                            8.82e-09 ***
## inc_boroughQueens                                               1.83e-09 ***
## inc_boroughStaten Island                                         < 2e-16 ***
## al_source_descEMS                                               0.102593    
## al_source_descEMS-911                                           0.012751 *  
## al_source_descCLASS-3                                           7.67e-05 ***
## al_source_descOthers                                             < 2e-16 ***
## al_index_descInitial Alarm                                      0.003915 ** 
## al_index_descOthers                                             0.970160    
## highest_al_levelFirst Alarm                                     0.360871    
## highest_al_level2nd-3rd Alarm                                   0.000697 ***
## inc_class_groupMedical MFAs                                     0.894646    
## inc_class_groupNonMedical Emergencies                            < 2e-16 ***
## inc_class_groupNonMedical MFAs                                   < 2e-16 ***
## inc_class_groupNonStructural Fires                              0.491607    
## inc_class_groupStructural Fires                                 3.95e-10 ***
## engines_assigned                                                0.009529 ** 
## ladders_assigned                                                 < 2e-16 ***
## others_units_assigned                                           0.723635    
## day_typeWeekend                                                 0.860810    
## time_of_dayMorning                                              7.59e-09 ***
## time_of_dayAfternoon                                             < 2e-16 ***
## time_of_dayEvening                                               < 2e-16 ***
## tua_is_oneY                                                     0.016581 *  
## log(ladders_assigned + 1)                                        < 2e-16 ***
## log(engines_assigned + 1)                                       0.000594 ***
## log(others_units_assigned + 1)                                  0.021028 *  
## inc_class_groupMedical MFAs:engines_assigned                    0.646947    
## inc_class_groupNonMedical Emergencies:engines_assigned          9.54e-06 ***
## inc_class_groupNonMedical MFAs:engines_assigned                 7.96e-09 ***
## inc_class_groupNonStructural Fires:engines_assigned             0.144310    
## inc_class_groupStructural Fires:engines_assigned                0.015139 *  
## day_typeWeekend:time_of_dayMorning                              0.001999 ** 
## day_typeWeekend:time_of_dayAfternoon                            0.003753 ** 
## day_typeWeekend:time_of_dayEvening                              0.443572    
## inc_class_groupMedical MFAs:log(engines_assigned + 1)           0.776748    
## inc_class_groupNonMedical Emergencies:log(engines_assigned + 1) 3.75e-15 ***
## inc_class_groupNonMedical MFAs:log(engines_assigned + 1)         < 2e-16 ***
## inc_class_groupNonStructural Fires:log(engines_assigned + 1)    0.281410    
## inc_class_groupStructural Fires:log(engines_assigned + 1)       2.41e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3661 on 23001 degrees of freedom
## Multiple R-squared:  0.1825, Adjusted R-squared:  0.1811 
## F-statistic: 125.3 on 41 and 23001 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_irm_full_upd_2)

residualPlots(lm_irm_full_upd_2)

##                                Test stat Pr(>|Test stat|)    
## inc_borough                                                  
## al_source_desc                                               
## al_index_desc                                                
## highest_al_level                                             
## inc_class_group                                              
## engines_assigned                 -6.0990        1.084e-09 ***
## ladders_assigned                 -6.8003        1.070e-11 ***
## others_units_assigned            -4.5106        6.497e-06 ***
## day_type                                                     
## time_of_day                                                  
## tua_is_one                                                   
## log(ladders_assigned + 1)         8.6788        < 2.2e-16 ***
## log(engines_assigned + 1)         9.5963        < 2.2e-16 ***
## log(others_units_assigned + 1)    4.9674        6.833e-07 ***
## Tukey test                        2.5600          0.01047 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
influenceIndexPlot(lm_irm_full_upd_2, vars = "Cook")

Again we it seems that nothing have changed regarding the qqPlot, whereas the distribution of fitted - residuals values is a little bit spreaded randomly. However here we have not constant variance on the residuals and we have also an influential point the 9504.

Let’s see the influential point and check how it behaves.

resp_min_fd.train[9504,]
##      inc_borough al_source_desc  al_index_desc highest_al_level inc_class_group
## 9504       Bronx        EMS-911 DEFAULT RECORD      First Alarm    Medical MFAs
##      inc_resp_min_qy engines_assigned ladders_assigned others_units_assigned
## 9504        4.516667                4                3                    15
##      day_type time_of_day tua_is_one
## 9504  Weekday     Evening          N

Now we investigate on why the 10036 observation is an influential point. Let’s see how is the behaviour of the logarithm scale of the assigned units for the Structural Fire and see if we can find the observation 9504.

infl_point <- subset(resp_min_fd.train, inc_class_group == "Medical MFAs")

# Create a boxplot for Engines Assigned
p1 <- ggplot(infl_point, aes(y = log(engines_assigned + 1))) +
  geom_boxplot() +
  ggtitle("Engines Assigned") +
  geom_point(aes(x = 0, y = log(resp_min_fd.train[9504, "engines_assigned"] + 1)), col = "red", pch = 16) +
  labs(title = "Engines Units Count",
       x = "Engines Units", y = "Count")

# Create a boxplot for Ladders Assigned
p2 <-ggplot(infl_point, aes(y = log(ladders_assigned + 1))) +
  geom_boxplot() +
  ggtitle("Ladders Assigned") +
  geom_point(aes(x = 0, y = log(resp_min_fd.train[9504, "ladders_assigned"] + 1)), col = "red", pch = 16) +
  labs(title = "Ladders Units Count",
       x = "Ladders Units", y = "Count")

# Create a boxplot for Other Units Assigned
p3 <- ggplot(infl_point, aes(y = log(others_units_assigned + 1))) +
  geom_boxplot() +
  ggtitle("Other Units Assigned") +
  geom_point(aes(x = 0, y = log(resp_min_fd.train[9504, "others_units_assigned"] + 1)), col = "red", pch = 16) +
  labs(title = "Other Units Count",
       x = "Other Units", y = "Count")

# Display the plots in a 1x3 grid
grid.arrange(p1, p2, p3, ncol = 3)

And we see that the observed incident is far away from the distribution of others assigned units for the Medical MFAs incident, so we decide to remove this observation during the refitting of the last model.

lm_irm_full_upd_3 <- update(lm_irm_full_upd_2, subset = -9504)
summary(lm_irm_full_upd_3)
## 
## Call:
## lm(formula = log(inc_resp_min_qy) ~ inc_borough + al_source_desc + 
##     al_index_desc + highest_al_level + inc_class_group + engines_assigned + 
##     ladders_assigned + others_units_assigned + day_type + time_of_day + 
##     tua_is_one + log(ladders_assigned + 1) + log(engines_assigned + 
##     1) + log(others_units_assigned + 1) + inc_class_group:engines_assigned + 
##     day_type:time_of_day + inc_class_group:log(engines_assigned + 
##     1), data = resp_min_fd.train, subset = -9504)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.67539 -0.19761 -0.00197  0.19814  2.55387 
## 
## Coefficients:
##                                                                  Estimate
## (Intercept)                                                      1.764847
## inc_boroughBrooklyn                                             -0.179775
## inc_boroughManhattan                                            -0.041868
## inc_boroughQueens                                               -0.045725
## inc_boroughStaten Island                                        -0.122401
## al_source_descEMS                                               -0.027197
## al_source_descEMS-911                                           -0.041425
## al_source_descCLASS-3                                            0.032895
## al_source_descOthers                                            -0.499213
## al_index_descInitial Alarm                                      -0.030541
## al_index_descOthers                                             -0.010355
## highest_al_levelFirst Alarm                                      0.118423
## highest_al_level2nd-3rd Alarm                                   -0.931039
## inc_class_groupMedical MFAs                                      0.012071
## inc_class_groupNonMedical Emergencies                            0.426243
## inc_class_groupNonMedical MFAs                                   0.574323
## inc_class_groupNonStructural Fires                              -0.065413
## inc_class_groupStructural Fires                                  0.381807
## engines_assigned                                                -0.195615
## ladders_assigned                                                 0.290993
## others_units_assigned                                            0.004452
## day_typeWeekend                                                  0.002383
## time_of_dayMorning                                              -0.054215
## time_of_dayAfternoon                                            -0.081532
## time_of_dayEvening                                              -0.130086
## tua_is_oneY                                                     -0.034017
## log(ladders_assigned + 1)                                       -0.679023
## log(engines_assigned + 1)                                        0.419479
## log(others_units_assigned + 1)                                  -0.051644
## inc_class_groupMedical MFAs:engines_assigned                     1.770697
## inc_class_groupNonMedical Emergencies:engines_assigned           0.340976
## inc_class_groupNonMedical MFAs:engines_assigned                  0.524001
## inc_class_groupNonStructural Fires:engines_assigned              0.162001
## inc_class_groupStructural Fires:engines_assigned                 0.201971
## day_typeWeekend:time_of_dayMorning                              -0.053080
## day_typeWeekend:time_of_dayAfternoon                            -0.048092
## day_typeWeekend:time_of_dayEvening                               0.013318
## inc_class_groupMedical MFAs:log(engines_assigned + 1)           -2.591759
## inc_class_groupNonMedical Emergencies:log(engines_assigned + 1) -1.012281
## inc_class_groupNonMedical MFAs:log(engines_assigned + 1)        -1.592082
## inc_class_groupNonStructural Fires:log(engines_assigned + 1)    -0.286766
## inc_class_groupStructural Fires:log(engines_assigned + 1)       -0.793631
##                                                                 Std. Error
## (Intercept)                                                       0.136218
## inc_boroughBrooklyn                                               0.006888
## inc_boroughManhattan                                              0.007297
## inc_boroughQueens                                                 0.007616
## inc_boroughStaten Island                                          0.012096
## al_source_descEMS                                                 0.016447
## al_source_descEMS-911                                             0.016576
## al_source_descCLASS-3                                             0.008295
## al_source_descOthers                                              0.016992
## al_index_descInitial Alarm                                        0.010596
## al_index_descOthers                                               0.137295
## highest_al_levelFirst Alarm                                       0.132030
## highest_al_level2nd-3rd Alarm                                     0.272636
## inc_class_groupMedical MFAs                                       0.165264
## inc_class_groupNonMedical Emergencies                             0.028837
## inc_class_groupNonMedical MFAs                                    0.041490
## inc_class_groupNonStructural Fires                                0.094039
## inc_class_groupStructural Fires                                   0.061243
## engines_assigned                                                  0.075376
## ladders_assigned                                                  0.026140
## others_units_assigned                                             0.009326
## day_typeWeekend                                                   0.013525
## time_of_dayMorning                                                0.009375
## time_of_dayAfternoon                                              0.009052
## time_of_dayEvening                                                0.009666
## tua_is_oneY                                                       0.014255
## log(ladders_assigned + 1)                                         0.050960
## log(engines_assigned + 1)                                         0.122198
## log(others_units_assigned + 1)                                    0.021434
## inc_class_groupMedical MFAs:engines_assigned                      0.925465
## inc_class_groupNonMedical Emergencies:engines_assigned            0.076918
## inc_class_groupNonMedical MFAs:engines_assigned                   0.090688
## inc_class_groupNonStructural Fires:engines_assigned               0.110608
## inc_class_groupStructural Fires:engines_assigned                  0.083381
## day_typeWeekend:time_of_dayMorning                                0.017181
## day_typeWeekend:time_of_dayAfternoon                              0.016466
## day_typeWeekend:time_of_dayEvening                                0.017478
## inc_class_groupMedical MFAs:log(engines_assigned + 1)             1.421777
## inc_class_groupNonMedical Emergencies:log(engines_assigned + 1)   0.128696
## inc_class_groupNonMedical MFAs:log(engines_assigned + 1)          0.171494
## inc_class_groupNonStructural Fires:log(engines_assigned + 1)      0.266352
## inc_class_groupStructural Fires:log(engines_assigned + 1)         0.168937
##                                                                 t value
## (Intercept)                                                      12.956
## inc_boroughBrooklyn                                             -26.101
## inc_boroughManhattan                                             -5.738
## inc_boroughQueens                                                -6.004
## inc_boroughStaten Island                                        -10.119
## al_source_descEMS                                                -1.654
## al_source_descEMS-911                                            -2.499
## al_source_descCLASS-3                                             3.965
## al_source_descOthers                                            -29.379
## al_index_descInitial Alarm                                       -2.882
## al_index_descOthers                                              -0.075
## highest_al_levelFirst Alarm                                       0.897
## highest_al_level2nd-3rd Alarm                                    -3.415
## inc_class_groupMedical MFAs                                       0.073
## inc_class_groupNonMedical Emergencies                            14.781
## inc_class_groupNonMedical MFAs                                   13.842
## inc_class_groupNonStructural Fires                               -0.696
## inc_class_groupStructural Fires                                   6.234
## engines_assigned                                                 -2.595
## ladders_assigned                                                 11.132
## others_units_assigned                                             0.477
## day_typeWeekend                                                   0.176
## time_of_dayMorning                                               -5.783
## time_of_dayAfternoon                                             -9.007
## time_of_dayEvening                                              -13.459
## tua_is_oneY                                                      -2.386
## log(ladders_assigned + 1)                                       -13.325
## log(engines_assigned + 1)                                         3.433
## log(others_units_assigned + 1)                                   -2.409
## inc_class_groupMedical MFAs:engines_assigned                      1.913
## inc_class_groupNonMedical Emergencies:engines_assigned            4.433
## inc_class_groupNonMedical MFAs:engines_assigned                   5.778
## inc_class_groupNonStructural Fires:engines_assigned               1.465
## inc_class_groupStructural Fires:engines_assigned                  2.422
## day_typeWeekend:time_of_dayMorning                               -3.090
## day_typeWeekend:time_of_dayAfternoon                             -2.921
## day_typeWeekend:time_of_dayEvening                                0.762
## inc_class_groupMedical MFAs:log(engines_assigned + 1)            -1.823
## inc_class_groupNonMedical Emergencies:log(engines_assigned + 1)  -7.866
## inc_class_groupNonMedical MFAs:log(engines_assigned + 1)         -9.284
## inc_class_groupNonStructural Fires:log(engines_assigned + 1)     -1.077
## inc_class_groupStructural Fires:log(engines_assigned + 1)        -4.698
##                                                                 Pr(>|t|)    
## (Intercept)                                                      < 2e-16 ***
## inc_boroughBrooklyn                                              < 2e-16 ***
## inc_boroughManhattan                                            9.71e-09 ***
## inc_boroughQueens                                               1.95e-09 ***
## inc_boroughStaten Island                                         < 2e-16 ***
## al_source_descEMS                                               0.098212 .  
## al_source_descEMS-911                                           0.012455 *  
## al_source_descCLASS-3                                           7.35e-05 ***
## al_source_descOthers                                             < 2e-16 ***
## al_index_descInitial Alarm                                      0.003953 ** 
## al_index_descOthers                                             0.939882    
## highest_al_levelFirst Alarm                                     0.369762    
## highest_al_level2nd-3rd Alarm                                   0.000639 ***
## inc_class_groupMedical MFAs                                     0.941776    
## inc_class_groupNonMedical Emergencies                            < 2e-16 ***
## inc_class_groupNonMedical MFAs                                   < 2e-16 ***
## inc_class_groupNonStructural Fires                              0.486694    
## inc_class_groupStructural Fires                                 4.62e-10 ***
## engines_assigned                                                0.009460 ** 
## ladders_assigned                                                 < 2e-16 ***
## others_units_assigned                                           0.633126    
## day_typeWeekend                                                 0.860151    
## time_of_dayMorning                                              7.42e-09 ***
## time_of_dayAfternoon                                             < 2e-16 ***
## time_of_dayEvening                                               < 2e-16 ***
## tua_is_oneY                                                     0.017025 *  
## log(ladders_assigned + 1)                                        < 2e-16 ***
## log(engines_assigned + 1)                                       0.000598 ***
## log(others_units_assigned + 1)                                  0.015986 *  
## inc_class_groupMedical MFAs:engines_assigned                    0.055721 .  
## inc_class_groupNonMedical Emergencies:engines_assigned          9.34e-06 ***
## inc_class_groupNonMedical MFAs:engines_assigned                 7.65e-09 ***
## inc_class_groupNonStructural Fires:engines_assigned             0.143033    
## inc_class_groupStructural Fires:engines_assigned                0.015432 *  
## day_typeWeekend:time_of_dayMorning                              0.002007 ** 
## day_typeWeekend:time_of_dayAfternoon                            0.003495 ** 
## day_typeWeekend:time_of_dayEvening                              0.446071    
## inc_class_groupMedical MFAs:log(engines_assigned + 1)           0.068331 .  
## inc_class_groupNonMedical Emergencies:log(engines_assigned + 1) 3.83e-15 ***
## inc_class_groupNonMedical MFAs:log(engines_assigned + 1)         < 2e-16 ***
## inc_class_groupNonStructural Fires:log(engines_assigned + 1)    0.281650    
## inc_class_groupStructural Fires:log(engines_assigned + 1)       2.65e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.366 on 23000 degrees of freedom
## Multiple R-squared:  0.1826, Adjusted R-squared:  0.1812 
## F-statistic: 125.4 on 41 and 23000 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_irm_full_upd_3)
## Warning: non si riesce a fare il plot senza sfruttarne uno:
##    3682

Again we are on the same situation of before.

residualPlots(lm_irm_full_upd_3)

##                                Test stat Pr(>|Test stat|)    
## inc_borough                                                  
## al_source_desc                                               
## al_index_desc                                                
## highest_al_level                                             
## inc_class_group                                              
## engines_assigned                 -6.1096        1.015e-09 ***
## ladders_assigned                 -6.8157        9.612e-12 ***
## others_units_assigned            -4.4780        7.572e-06 ***
## day_type                                                     
## time_of_day                                                  
## tua_is_one                                                   
## log(ladders_assigned + 1)         8.6987        < 2.2e-16 ***
## log(engines_assigned + 1)         9.6015        < 2.2e-16 ***
## log(others_units_assigned + 1)    4.9321        8.194e-07 ***
## Tukey test                        2.5757             0.01 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqPlot(residuals(lm_irm_full_upd_3))

## 11439  8399 
## 11438  8399
influenceIndexPlot(lm_irm_full_upd_3, vars = "Cook")

summary(fitted(lm_irm_full_upd_3))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6401  1.5997  1.7104  1.7008  1.8040  2.3946

We try again to remove the influential point even if it is not outside the cook’s band.

resp_min_fd.train[16373,]
##       inc_borough al_source_desc al_index_desc highest_al_level
## 16373    Brooklyn          PHONE        Others    2nd-3rd Alarm
##        inc_class_group inc_resp_min_qy engines_assigned ladders_assigned
## 16373 Structural Fires        2.916667               18               13
##       others_units_assigned day_type time_of_day tua_is_one
## 16373                    25  Weekend       Night          N
infl_point <- subset(resp_min_fd.train, inc_class_group == "Structural Fires")

# Create a boxplot for Engines Assigned
p1 <- ggplot(infl_point, aes(y = log(engines_assigned + 1))) +
  geom_boxplot() +
  ggtitle("Engines Assigned") +
  geom_point(aes(x = 0, y = log(resp_min_fd.train[16373, "engines_assigned"] + 1)), col = "red", pch = 16) +
  labs(title = "Engines Units Count",
       x = "Engines Units", y = "Count")

# Create a boxplot for Ladders Assigned
p2 <-ggplot(infl_point, aes(y = log(ladders_assigned + 1))) +
  geom_boxplot() +
  ggtitle("Ladders Assigned") +
  geom_point(aes(x = 0, y = log(resp_min_fd.train[16373, "ladders_assigned"] + 1)), col = "red", pch = 16) +
  labs(title = "Ladders Units Count",
       x = "Ladders Units", y = "Count")

# Create a boxplot for Other Units Assigned
p3 <- ggplot(infl_point, aes(y = log(others_units_assigned + 1))) +
  geom_boxplot() +
  ggtitle("Other Units Assigned") +
  geom_point(aes(x = 0, y = log(resp_min_fd.train[16373, "others_units_assigned"] + 1)), col = "red", pch = 16) +
  labs(title = "Other Units Count",
       x = "Other Units", y = "Count")

# Display the plots in a 1x3 grid
grid.arrange(p1, p2, p3, ncol = 3)

lm_irm_full_upd_4 <- update(lm_irm_full_upd_2, subset = -c(16373, 9504))
summary(lm_irm_full_upd_4)
## 
## Call:
## lm(formula = log(inc_resp_min_qy) ~ inc_borough + al_source_desc + 
##     al_index_desc + highest_al_level + inc_class_group + engines_assigned + 
##     ladders_assigned + others_units_assigned + day_type + time_of_day + 
##     tua_is_one + log(ladders_assigned + 1) + log(engines_assigned + 
##     1) + log(others_units_assigned + 1) + inc_class_group:engines_assigned + 
##     day_type:time_of_day + inc_class_group:log(engines_assigned + 
##     1), data = resp_min_fd.train, subset = -c(16373, 9504))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.67877 -0.19727 -0.00203  0.19841  2.55264 
## 
## Coefficients:
##                                                                  Estimate
## (Intercept)                                                      1.793425
## inc_boroughBrooklyn                                             -0.179560
## inc_boroughManhattan                                            -0.041922
## inc_boroughQueens                                               -0.045695
## inc_boroughStaten Island                                        -0.122824
## al_source_descEMS                                               -0.027649
## al_source_descEMS-911                                           -0.041880
## al_source_descCLASS-3                                            0.033267
## al_source_descOthers                                            -0.499566
## al_index_descInitial Alarm                                      -0.030482
## al_index_descOthers                                             -0.074398
## highest_al_levelFirst Alarm                                      0.093107
## highest_al_level2nd-3rd Alarm                                   -0.976258
## inc_class_groupMedical MFAs                                      0.012376
## inc_class_groupNonMedical Emergencies                            0.426807
## inc_class_groupNonMedical MFAs                                   0.574842
## inc_class_groupNonStructural Fires                              -0.065866
## inc_class_groupStructural Fires                                  0.434765
## engines_assigned                                                -0.196669
## ladders_assigned                                                 0.298768
## others_units_assigned                                            0.005543
## day_typeWeekend                                                  0.003642
## time_of_dayMorning                                              -0.054005
## time_of_dayAfternoon                                            -0.081320
## time_of_dayEvening                                              -0.129651
## tua_is_oneY                                                     -0.035994
## log(ladders_assigned + 1)                                       -0.693855
## log(engines_assigned + 1)                                        0.419252
## log(others_units_assigned + 1)                                  -0.054969
## inc_class_groupMedical MFAs:engines_assigned                     1.773750
## inc_class_groupNonMedical Emergencies:engines_assigned           0.342424
## inc_class_groupNonMedical MFAs:engines_assigned                  0.526156
## inc_class_groupNonStructural Fires:engines_assigned              0.163860
## inc_class_groupStructural Fires:engines_assigned                 0.253909
## day_typeWeekend:time_of_dayMorning                              -0.054177
## day_typeWeekend:time_of_dayAfternoon                            -0.049276
## day_typeWeekend:time_of_dayEvening                               0.011822
## inc_class_groupMedical MFAs:log(engines_assigned + 1)           -2.596486
## inc_class_groupNonMedical Emergencies:log(engines_assigned + 1) -1.014810
## inc_class_groupNonMedical MFAs:log(engines_assigned + 1)        -1.596092
## inc_class_groupNonStructural Fires:log(engines_assigned + 1)    -0.289099
## inc_class_groupStructural Fires:log(engines_assigned + 1)       -0.946323
##                                                                 Std. Error
## (Intercept)                                                       0.136505
## inc_boroughBrooklyn                                               0.006887
## inc_boroughManhattan                                              0.007296
## inc_boroughQueens                                                 0.007614
## inc_boroughStaten Island                                          0.012095
## al_source_descEMS                                                 0.016444
## al_source_descEMS-911                                             0.016573
## al_source_descCLASS-3                                             0.008295
## al_source_descOthers                                              0.016989
## al_index_descInitial Alarm                                        0.010594
## al_index_descOthers                                               0.138817
## highest_al_levelFirst Alarm                                       0.132258
## highest_al_level2nd-3rd Alarm                                     0.272975
## inc_class_groupMedical MFAs                                       0.165233
## inc_class_groupNonMedical Emergencies                             0.028832
## inc_class_groupNonMedical MFAs                                    0.041483
## inc_class_groupNonStructural Fires                                0.094022
## inc_class_groupStructural Fires                                   0.063572
## engines_assigned                                                  0.075363
## ladders_assigned                                                  0.026255
## others_units_assigned                                             0.009331
## day_typeWeekend                                                   0.013528
## time_of_dayMorning                                                0.009373
## time_of_dayAfternoon                                              0.009050
## time_of_dayEvening                                                0.009665
## tua_is_oneY                                                       0.014267
## log(ladders_assigned + 1)                                         0.051175
## log(engines_assigned + 1)                                         0.122175
## log(others_units_assigned + 1)                                    0.021457
## inc_class_groupMedical MFAs:engines_assigned                      0.925293
## inc_class_groupNonMedical Emergencies:engines_assigned            0.076905
## inc_class_groupNonMedical MFAs:engines_assigned                   0.090673
## inc_class_groupNonStructural Fires:engines_assigned               0.110589
## inc_class_groupStructural Fires:engines_assigned                  0.085034
## day_typeWeekend:time_of_dayMorning                                0.017181
## day_typeWeekend:time_of_dayAfternoon                              0.016467
## day_typeWeekend:time_of_dayEvening                                0.017482
## inc_class_groupMedical MFAs:log(engines_assigned + 1)             1.421512
## inc_class_groupNonMedical Emergencies:log(engines_assigned + 1)   0.128675
## inc_class_groupNonMedical MFAs:log(engines_assigned + 1)          0.171467
## inc_class_groupNonStructural Fires:log(engines_assigned + 1)      0.266303
## inc_class_groupStructural Fires:log(engines_assigned + 1)         0.175949
##                                                                 t value
## (Intercept)                                                      13.138
## inc_boroughBrooklyn                                             -26.073
## inc_boroughManhattan                                             -5.746
## inc_boroughQueens                                                -6.001
## inc_boroughStaten Island                                        -10.155
## al_source_descEMS                                                -1.681
## al_source_descEMS-911                                            -2.527
## al_source_descCLASS-3                                             4.011
## al_source_descOthers                                            -29.404
## al_index_descInitial Alarm                                       -2.877
## al_index_descOthers                                              -0.536
## highest_al_levelFirst Alarm                                       0.704
## highest_al_level2nd-3rd Alarm                                    -3.576
## inc_class_groupMedical MFAs                                       0.075
## inc_class_groupNonMedical Emergencies                            14.803
## inc_class_groupNonMedical MFAs                                   13.857
## inc_class_groupNonStructural Fires                               -0.701
## inc_class_groupStructural Fires                                   6.839
## engines_assigned                                                 -2.610
## ladders_assigned                                                 11.379
## others_units_assigned                                             0.594
## day_typeWeekend                                                   0.269
## time_of_dayMorning                                               -5.762
## time_of_dayAfternoon                                             -8.985
## time_of_dayEvening                                              -13.415
## tua_is_oneY                                                      -2.523
## log(ladders_assigned + 1)                                       -13.559
## log(engines_assigned + 1)                                         3.432
## log(others_units_assigned + 1)                                   -2.562
## inc_class_groupMedical MFAs:engines_assigned                      1.917
## inc_class_groupNonMedical Emergencies:engines_assigned            4.453
## inc_class_groupNonMedical MFAs:engines_assigned                   5.803
## inc_class_groupNonStructural Fires:engines_assigned               1.482
## inc_class_groupStructural Fires:engines_assigned                  2.986
## day_typeWeekend:time_of_dayMorning                               -3.153
## day_typeWeekend:time_of_dayAfternoon                             -2.992
## day_typeWeekend:time_of_dayEvening                                0.676
## inc_class_groupMedical MFAs:log(engines_assigned + 1)            -1.827
## inc_class_groupNonMedical Emergencies:log(engines_assigned + 1)  -7.887
## inc_class_groupNonMedical MFAs:log(engines_assigned + 1)         -9.308
## inc_class_groupNonStructural Fires:log(engines_assigned + 1)     -1.086
## inc_class_groupStructural Fires:log(engines_assigned + 1)        -5.378
##                                                                 Pr(>|t|)    
## (Intercept)                                                      < 2e-16 ***
## inc_boroughBrooklyn                                              < 2e-16 ***
## inc_boroughManhattan                                            9.25e-09 ***
## inc_boroughQueens                                               1.99e-09 ***
## inc_boroughStaten Island                                         < 2e-16 ***
## al_source_descEMS                                               0.092705 .  
## al_source_descEMS-911                                           0.011513 *  
## al_source_descCLASS-3                                           6.07e-05 ***
## al_source_descOthers                                             < 2e-16 ***
## al_index_descInitial Alarm                                      0.004016 ** 
## al_index_descOthers                                             0.592004    
## highest_al_levelFirst Alarm                                     0.481453    
## highest_al_level2nd-3rd Alarm                                   0.000349 ***
## inc_class_groupMedical MFAs                                     0.940295    
## inc_class_groupNonMedical Emergencies                            < 2e-16 ***
## inc_class_groupNonMedical MFAs                                   < 2e-16 ***
## inc_class_groupNonStructural Fires                              0.483595    
## inc_class_groupStructural Fires                                 8.18e-12 ***
## engines_assigned                                                0.009070 ** 
## ladders_assigned                                                 < 2e-16 ***
## others_units_assigned                                           0.552494    
## day_typeWeekend                                                 0.787750    
## time_of_dayMorning                                              8.44e-09 ***
## time_of_dayAfternoon                                             < 2e-16 ***
## time_of_dayEvening                                               < 2e-16 ***
## tua_is_oneY                                                     0.011643 *  
## log(ladders_assigned + 1)                                        < 2e-16 ***
## log(engines_assigned + 1)                                       0.000601 ***
## log(others_units_assigned + 1)                                  0.010419 *  
## inc_class_groupMedical MFAs:engines_assigned                    0.055255 .  
## inc_class_groupNonMedical Emergencies:engines_assigned          8.52e-06 ***
## inc_class_groupNonMedical MFAs:engines_assigned                 6.61e-09 ***
## inc_class_groupNonStructural Fires:engines_assigned             0.138434    
## inc_class_groupStructural Fires:engines_assigned                0.002830 ** 
## day_typeWeekend:time_of_dayMorning                              0.001617 ** 
## day_typeWeekend:time_of_dayAfternoon                            0.002771 ** 
## day_typeWeekend:time_of_dayEvening                              0.498878    
## inc_class_groupMedical MFAs:log(engines_assigned + 1)           0.067778 .  
## inc_class_groupNonMedical Emergencies:log(engines_assigned + 1) 3.24e-15 ***
## inc_class_groupNonMedical MFAs:log(engines_assigned + 1)         < 2e-16 ***
## inc_class_groupNonStructural Fires:log(engines_assigned + 1)    0.277667    
## inc_class_groupStructural Fires:log(engines_assigned + 1)       7.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.366 on 22999 degrees of freedom
## Multiple R-squared:  0.1829, Adjusted R-squared:  0.1814 
## F-statistic: 125.6 on 41 and 22999 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_irm_full_upd_4)
## Warning: non si riesce a fare il plot senza sfruttarne uno:
##    3682

residualPlots(lm_irm_full_upd_4)

##                                Test stat Pr(>|Test stat|)    
## inc_borough                                                  
## al_source_desc                                               
## al_index_desc                                                
## highest_al_level                                             
## inc_class_group                                              
## engines_assigned                 -6.2667        3.753e-10 ***
## ladders_assigned                 -6.1606        7.368e-10 ***
## others_units_assigned            -3.4382        0.0005866 ***
## day_type                                                     
## time_of_day                                                  
## tua_is_one                                                   
## log(ladders_assigned + 1)         8.1271        4.618e-16 ***
## log(engines_assigned + 1)         9.0977        < 2.2e-16 ***
## log(others_units_assigned + 1)    4.4294        9.494e-06 ***
## Tukey test                        2.1411        0.0322669 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqPlot(residuals(lm_irm_full_upd_4))

## 11439  8399 
## 11438  8399
influenceIndexPlot(lm_irm_full_upd_4, vars = "Cook")

summary(fitted(lm_irm_full_upd_4))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6749  1.5995  1.7104  1.7008  1.8038  2.3965

However we are still in a situation of non normal distribution of the residuals thus we can’t apply a linear regression model of the log scale of inc_resp_min_qy. Let’s see if in the other type response the assumption of linearity are meet or not (spoiler…they are not verified again :( ).

4.1.2 Use emergency_min_qy as response

Again we make a copy of the train and test. This time we decided to merge the assigned units in a single predictor deleting the single counts in order to see if we have an improvement on the qqPlots for the next models.

# make a copy of the train and test
emerg_min_fd.train <- fire_data.train
emerg_min_fd.test <- fire_data.test

emerg_min_fd.train$total_assigned_units = emerg_min_fd.train$engines_assigned + emerg_min_fd.train$ladders_assigned + emerg_min_fd.train$others_units_assigned
emerg_min_fd.test$total_assigned_units = emerg_min_fd.test$engines_assigned + emerg_min_fd.test$ladders_assigned + emerg_min_fd.test$others_units_assigned

# remove the future time differences and units counts
emerg_min_fd.train <- emerg_min_fd.train %>% 
  select(-c(inc_resp_min_qy, ticket_time, engines_assigned, ladders_assigned, others_units_assigned))
emerg_min_fd.test <- emerg_min_fd.test %>% 
  select(-c(inc_resp_min_qy, ticket_time, engines_assigned, ladders_assigned, others_units_assigned))

Fit a linear regression model with all the predictors.

lm_em_full <- lm(emergency_min_qy ~ ., data = emerg_min_fd.train)
summary(lm_em_full)
## 
## Call:
## lm(formula = emergency_min_qy ~ ., data = emerg_min_fd.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -159.34   -8.44   -3.20    3.68  447.06 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            36.00960    5.65956   6.363 2.02e-10 ***
## inc_boroughBrooklyn                    -7.28090    0.29904 -24.348  < 2e-16 ***
## inc_boroughManhattan                   -1.01384    0.31406  -3.228 0.001247 ** 
## inc_boroughQueens                      -4.88731    0.32744 -14.926  < 2e-16 ***
## inc_boroughStaten Island               -7.30118    0.52112 -14.011  < 2e-16 ***
## al_source_descEMS                       1.65247    0.68751   2.404 0.016244 *  
## al_source_descEMS-911                   2.70914    0.69596   3.893 9.94e-05 ***
## al_source_descCLASS-3                  -3.84839    0.34121 -11.279  < 2e-16 ***
## al_source_descOthers                    3.70639    0.70369   5.267 1.40e-07 ***
## al_index_descInitial Alarm             16.42672    0.45649  35.985  < 2e-16 ***
## al_index_descOthers                    96.50396    5.42673  17.783  < 2e-16 ***
## highest_al_levelFirst Alarm           -31.00041    5.59102  -5.545 2.98e-08 ***
## highest_al_level2nd-3rd Alarm         230.75863    7.63898  30.208  < 2e-16 ***
## inc_class_groupMedical MFAs            -0.77395    1.54247  -0.502 0.615842    
## inc_class_groupNonMedical Emergencies  -7.09947    0.66585 -10.662  < 2e-16 ***
## inc_class_groupNonMedical MFAs         -2.00122    0.98708  -2.027 0.042632 *  
## inc_class_groupNonStructural Fires      2.65601    1.03206   2.574 0.010073 *  
## inc_class_groupStructural Fires        -5.79816    0.85777  -6.760 1.42e-11 ***
## disp_resp_min_qy                        0.13220    0.23913   0.553 0.580382    
## inc_travel_min_qy                       0.16048    0.04166   3.852 0.000117 ***
## day_typeWeekend                        -0.25255    0.23424  -1.078 0.280964    
## time_of_dayMorning                      1.33597    0.33836   3.948 7.89e-05 ***
## time_of_dayAfternoon                    0.56196    0.32589   1.724 0.084647 .  
## time_of_dayEvening                     -0.37739    0.34802  -1.084 0.278210    
## tua_is_oneY                            -0.30932    0.39444  -0.784 0.432934    
## total_assigned_units                    1.38403    0.08355  16.565  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.78 on 23017 degrees of freedom
## Multiple R-squared:  0.3417, Adjusted R-squared:  0.3409 
## F-statistic: 477.8 on 25 and 23017 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_em_full)

Here the fitted values vs the residual are behaving in a liner relation but they are not randomly spread since we can view two / three clusters, the same thing discussion can be made for the Scale Location plot in which we can see that there is no constant variance. But again we are in a situation in which the residuals are not normally distributed as we can more clearly see in the following plot.

qqPlot(residuals(lm_em_full))

## [1] 17313 15855

Let’s have a look of the possible power transformation of the response.

powerTransform(lm_em_full)
## Estimated transformation parameter 
##        Y1 
## 0.1454326

We will try transform the response both using the logarithm scale and the power of 0.14. in order to see in we have an improvement on the distribution of the residuals.

First by using the suggested power trasformation of 0.14.

lm_em_full_014 <- update(lm_em_full, I(emergency_min_qy ^ 0.14) ~ .)
summary(lm_em_full_014)
## 
## Call:
## lm(formula = I(emergency_min_qy^0.14) ~ inc_borough + al_source_desc + 
##     al_index_desc + highest_al_level + inc_class_group + disp_resp_min_qy + 
##     inc_travel_min_qy + day_type + time_of_day + tua_is_one + 
##     total_assigned_units, data = emerg_min_fd.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77825 -0.09195 -0.00581  0.09105  0.71297 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            1.2851931  0.0551223  23.315  < 2e-16
## inc_boroughBrooklyn                   -0.0776249  0.0029125 -26.652  < 2e-16
## inc_boroughManhattan                  -0.0089630  0.0030588  -2.930 0.003390
## inc_boroughQueens                     -0.0535928  0.0031891 -16.805  < 2e-16
## inc_boroughStaten Island              -0.0888417  0.0050755 -17.504  < 2e-16
## al_source_descEMS                      0.0135942  0.0066961   2.030 0.042352
## al_source_descEMS-911                  0.0185578  0.0067784   2.738 0.006191
## al_source_descCLASS-3                 -0.0524990  0.0033233 -15.797  < 2e-16
## al_source_descOthers                   0.0138225  0.0068537   2.017 0.043728
## al_index_descInitial Alarm             0.2985189  0.0044461  67.142  < 2e-16
## al_index_descOthers                    0.6474838  0.0528546  12.250  < 2e-16
## highest_al_levelFirst Alarm           -0.0682250  0.0544547  -1.253 0.210263
## highest_al_level2nd-3rd Alarm          0.0687444  0.0744012   0.924 0.355512
## inc_class_groupMedical MFAs            0.0289789  0.0150232   1.929 0.053749
## inc_class_groupNonMedical Emergencies -0.0690382  0.0064852 -10.646  < 2e-16
## inc_class_groupNonMedical MFAs        -0.0118967  0.0096138  -1.237 0.215931
## inc_class_groupNonStructural Fires    -0.0088083  0.0100519  -0.876 0.380883
## inc_class_groupStructural Fires       -0.0587682  0.0083544  -7.034 2.06e-12
## disp_resp_min_qy                       0.0002734  0.0023291   0.117 0.906562
## inc_travel_min_qy                     -0.0006612  0.0004058  -1.630 0.103202
## day_typeWeekend                       -0.0021166  0.0022814  -0.928 0.353544
## time_of_dayMorning                     0.0189708  0.0032955   5.757 8.70e-09
## time_of_dayAfternoon                   0.0104510  0.0031740   3.293 0.000994
## time_of_dayEvening                    -0.0005981  0.0033896  -0.176 0.859931
## tua_is_oneY                           -0.0246653  0.0038417  -6.420 1.39e-10
## total_assigned_units                   0.0096417  0.0008138  11.848  < 2e-16
##                                          
## (Intercept)                           ***
## inc_boroughBrooklyn                   ***
## inc_boroughManhattan                  ** 
## inc_boroughQueens                     ***
## inc_boroughStaten Island              ***
## al_source_descEMS                     *  
## al_source_descEMS-911                 ** 
## al_source_descCLASS-3                 ***
## al_source_descOthers                  *  
## al_index_descInitial Alarm            ***
## al_index_descOthers                   ***
## highest_al_levelFirst Alarm              
## highest_al_level2nd-3rd Alarm            
## inc_class_groupMedical MFAs           .  
## inc_class_groupNonMedical Emergencies ***
## inc_class_groupNonMedical MFAs           
## inc_class_groupNonStructural Fires       
## inc_class_groupStructural Fires       ***
## disp_resp_min_qy                         
## inc_travel_min_qy                        
## day_typeWeekend                          
## time_of_dayMorning                    ***
## time_of_dayAfternoon                  ***
## time_of_dayEvening                       
## tua_is_oneY                           ***
## total_assigned_units                  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1537 on 23017 degrees of freedom
## Multiple R-squared:  0.2923, Adjusted R-squared:  0.2915 
## F-statistic: 380.2 on 25 and 23017 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_em_full_014)

qqPlot(residuals(lm_em_full_014))

## [1]   686 18514

In this case the two tails appear to be more homogeneous, however the situation is the same so we do not have normal distribution of residuals as we can see by the qqPlot.

Trying the logarithm scale.

lm_em_full_log <- update(lm_em_full, log(emergency_min_qy) ~ .)
summary(lm_em_full_log)
## 
## Call:
## lm(formula = log(emergency_min_qy) ~ inc_borough + al_source_desc + 
##     al_index_desc + highest_al_level + inc_class_group + disp_resp_min_qy + 
##     inc_travel_min_qy + day_type + time_of_day + tua_is_one + 
##     total_assigned_units, data = emerg_min_fd.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5221 -0.4348  0.0122  0.4897  3.3584 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            1.582598   0.286900   5.516 3.50e-08 ***
## inc_boroughBrooklyn                   -0.378681   0.015159 -24.981  < 2e-16 ***
## inc_boroughManhattan                  -0.043940   0.015920  -2.760 0.005786 ** 
## inc_boroughQueens                     -0.262438   0.016599 -15.811  < 2e-16 ***
## inc_boroughStaten Island              -0.442181   0.026417 -16.738  < 2e-16 ***
## al_source_descEMS                      0.057177   0.034852   1.641 0.100903    
## al_source_descEMS-911                  0.080740   0.035280   2.289 0.022116 *  
## al_source_descCLASS-3                 -0.266010   0.017297 -15.379  < 2e-16 ***
## al_source_descOthers                   0.031943   0.035672   0.895 0.370547    
## al_index_descInitial Alarm             1.662576   0.023141  71.846  < 2e-16 ***
## al_index_descOthers                    3.093289   0.275097  11.244  < 2e-16 ***
## highest_al_levelFirst Alarm           -0.278414   0.283425  -0.982 0.325952    
## highest_al_level2nd-3rd Alarm         -0.138134   0.387242  -0.357 0.721310    
## inc_class_groupMedical MFAs            0.240270   0.078192   3.073 0.002123 ** 
## inc_class_groupNonMedical Emergencies -0.338830   0.033754 -10.038  < 2e-16 ***
## inc_class_groupNonMedical MFAs        -0.018906   0.050038  -0.378 0.705551    
## inc_class_groupNonStructural Fires    -0.076189   0.052318  -1.456 0.145330    
## inc_class_groupStructural Fires       -0.298421   0.043483  -6.863 6.92e-12 ***
## disp_resp_min_qy                       0.001249   0.012122   0.103 0.917958    
## inc_travel_min_qy                     -0.007062   0.002112  -3.344 0.000827 ***
## day_typeWeekend                       -0.009696   0.011874  -0.817 0.414182    
## time_of_dayMorning                     0.092398   0.017153   5.387 7.24e-08 ***
## time_of_dayAfternoon                   0.051043   0.016520   3.090 0.002006 ** 
## time_of_dayEvening                    -0.001901   0.017642  -0.108 0.914173    
## tua_is_oneY                           -0.132984   0.019995  -6.651 2.98e-11 ***
## total_assigned_units                   0.046169   0.004235  10.901  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7998 on 23017 degrees of freedom
## Multiple R-squared:  0.2978, Adjusted R-squared:  0.2971 
## F-statistic: 390.5 on 25 and 23017 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_em_full_log)

Again we note the presence of three clusters on the first and third plots, let’s investigate a bit in order to gain some additional information.

ggplot(lm_em_full_log, aes(x = .fitted, y = .resid)) +
  geom_point(aes(color = inc_class_group)) +
  geom_hline(yintercept = 0) +
  labs(title = "Residuals VS Fitted",
       x = "Fitted Values", y = "Residuals", color = "Incident Class Group")

The right cluster is for the Structural Fires, the middle one contain both Medical and NonMedical Emergencies with some Strictural and NonStructural Fires, and the left one contains NonMedical Emergencies and a small number of Medical one and Medical and NonMedical MFAs.

qqPlot(residuals(lm_em_full_log))

## [1]   686 18514

Here the right tail is less far from the 95 confidence interval respect the previous model, but on the other hand the left tail is heavily skewed to the bottom of the interval. Indicating that again we do not reach the normal distribution of the residual.

In conclusion we end up in a situation where the linearity assumptions are not meet thus we can’t use a regression model to perform prediction even with the logarithm transformation of the response. It is much likely that a powerful methods should be taken into account for this analysis with a deeper study of the relationship between predictors.

4.2 Cast the anaysis to a Calssification task

As mention before we decided like the professor suggested to us, to cast our regression problem in a classification problem by dividing the range of possible time difference response into 2 for a binary classification or more than 2 for a multi-classification task.

We will use the same predictors of the Regression Section so first inc_resp_min_qy and then emergency_min_qy.

So first thing first we have to decide the range of value for both of them.

summary(fire_data_new$inc_resp_min_qy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.350   4.383   5.467   5.949   6.850  58.917
summary(fire_data_new$emergency_min_qy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.05    7.15   11.93   17.08   19.58  596.38

We start by considering a classical binary classification in which the threshold used as response is the mean of the respective responses, thus:

# threshold for inc_resp_min_qy
th_irm <- summary(fire_data_new$inc_resp_min_qy)[4]

# threshold for emergency_min_qy
th_eme <- summary(fire_data_new$emergency_min_qy)[4]

4.2.1 Use the range of inc_resp_min_qy as response

# make a copy of the train and test
cl_resp_min_fd.train <- fire_data.train
cl_resp_min_fd.test <- fire_data.test

cl_resp_min_fd.train$fast_response <- cl_resp_min_fd.train$inc_resp_min_qy < th_irm
cl_resp_min_fd.test$fast_response <- cl_resp_min_fd.test$inc_resp_min_qy < th_irm

# remove the future time differences
cl_resp_min_fd.train <- cl_resp_min_fd.train %>% select(-c(disp_resp_min_qy, inc_travel_min_qy, emergency_min_qy, ticket_time, inc_resp_min_qy))
cl_resp_min_fd.test <- cl_resp_min_fd.test %>% select(-c(disp_resp_min_qy, inc_travel_min_qy, emergency_min_qy, ticket_time, inc_resp_min_qy))
glm.fit_full <- glm(fast_response ~ ., data = cl_resp_min_fd.train, family = binomial)
summary(glm.fit_full)
## 
## Call:
## glm(formula = fast_response ~ ., family = binomial, data = cl_resp_min_fd.train)
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                            0.59233    1.20858   0.490 0.624059    
## inc_boroughBrooklyn                    1.08113    0.04189  25.808  < 2e-16 ***
## inc_boroughManhattan                   0.19367    0.04249   4.558 5.15e-06 ***
## inc_boroughQueens                      0.34738    0.04421   7.857 3.92e-15 ***
## inc_boroughStaten Island               0.77875    0.07233  10.767  < 2e-16 ***
## al_source_descEMS                      0.37809    0.09838   3.843 0.000121 ***
## al_source_descEMS-911                  0.38790    0.09944   3.901 9.58e-05 ***
## al_source_descCLASS-3                 -0.24070    0.05044  -4.772 1.83e-06 ***
## al_source_descOthers                   0.68813    0.11228   6.129 8.87e-10 ***
## al_index_descInitial Alarm             0.14764    0.06012   2.456 0.014065 *  
## al_index_descOthers                   -1.84737    1.12910  -1.636 0.101810    
## highest_al_levelFirst Alarm           -1.21307    1.20160  -1.010 0.312711    
## highest_al_level2nd-3rd Alarm          4.12543   71.67297   0.058 0.954100    
## inc_class_groupMedical MFAs           -0.14191    0.20432  -0.695 0.487338    
## inc_class_groupNonMedical Emergencies -0.14312    0.09537  -1.501 0.133434    
## inc_class_groupNonMedical MFAs         0.10486    0.13903   0.754 0.450713    
## inc_class_groupNonStructural Fires     0.53125    0.15826   3.357 0.000788 ***
## inc_class_groupStructural Fires        0.28383    0.12945   2.193 0.028336 *  
## engines_assigned                       0.32005    0.02633  12.154  < 2e-16 ***
## ladders_assigned                      -0.03555    0.04025  -0.883 0.377146    
## others_units_assigned                  0.22656    0.03493   6.485 8.86e-11 ***
## day_typeWeekend                        0.14590    0.03272   4.458 8.26e-06 ***
## time_of_dayMorning                     0.38885    0.04611   8.433  < 2e-16 ***
## time_of_dayAfternoon                   0.54527    0.04462  12.222  < 2e-16 ***
## time_of_dayEvening                     0.79213    0.04833  16.389  < 2e-16 ***
## tua_is_oneY                           -0.90479    0.05733 -15.783  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 30976  on 23042  degrees of freedom
## Residual deviance: 27741  on 23017  degrees of freedom
## AIC: 27793
## 
## Number of Fisher Scoring iterations: 10
glm.probs <- predict(glm.fit_full, newdata = cl_resp_min_fd.test, type = "response")
preds50 <- glm.probs > 0.5
table(preds = preds50, true = cl_resp_min_fd.test$fast_response)
##        true
## preds   FALSE TRUE
##   FALSE  1340  803
##   TRUE   1711 3828
mean(preds50 == cl_resp_min_fd.test$fast_response)
## [1] 0.6727415

The ROC curve can be computed with package pROC:

library(pROC)
glm.roc <- roc(cl_resp_min_fd.test$fast_response ~ glm.probs, plot = TRUE, print.auc = TRUE)

4.2.2 Use the range of emergency_min_qy as response